analytic global osce ratings are sensitive to level of training

Analytic global OSCE ratings are sensitive to level of training

Brian Hodges1 & Jodi Herold McIlroy2

Purpose There are several reasons for using global rat-

ings in addition to checklists for scoring objective

structured clinical examination (OSCE) stations.

However, there has been little evidence collected

regarding the validity of these scales. This study

assessed the construct validity of an analytic global

rating with 4 component subscales: empathy, coher-

ence, verbal and non-verbal expression.

Methods A total of 19 Year 3 and 38 Year 4 clinical clerks

were scored on content checklists and these global

ratings during a 10-station OSCE. T-tests were used to

assess differences between groups for overall checklist

and global scores, and for each of the 4 subscales.

Results The mean global rating was significantly higher

for senior clerks (75Æ5% versus 71Æ3%, t55 ¼ 2Æ12,

P < 0Æ05) and there were significant differences by

level of training for the coherence (t55 ¼ 3Æ33,

P < 0Æ01) and verbal communication (t55 ¼ 2Æ33,

P < 0Æ05) subscales. Interstation reliability was 0Æ70

for the global rating and ranged from 0Æ58 to 0Æ65 for

the subscales. Checklist reliability was 0Æ54.

Conclusion In this study, a summated analytic global

rating demonstrated construct validity, as did 2 of the 4

scales measuring specific traits. In addition, the analytic

global rating showed substantially higher internal con-

sistency than did the checklists, a finding consistent

with that seen in previous studies cited in the literature.

Global ratings are an important element of OSCE

measurement and can have good psychometric prop-

erties. However, OSCE researchers should clearly

describe the type of global ratings they use. Further

research is needed to define the most effective global

rating scales.

Keywords education, medical ⁄ *standards; clinical

clerkships ⁄*standards; *educational measurement;

reproducibility of results; psychometrics/methods.

Medical Education 2003;37:1012–1016

Introduction

The assessment of clinical competence in medical

education has been greatly advanced over the past 2

decades by the widespread use of objective structured

clinical examinations (OSCEs). The OSCE usually

combines the reality of live clinical interactions with the

standardisation of problems and the use of multiple

observers.1 Typically, an OSCE consists of a series of

independently scored stations, each of which involves a

standardised patient (SP) who portrays a clinical

scenario. Scoring at each station is performed either

by an examiner who ticks off items on a binary checklist

while observing the scenario from within the room or

through a 1-way mirror, or by the SP at the end of the

encounter. Following each encounter, the examiner or

SP may also complete 1 or more global ratings

regarding general dimensions of performance. In the

vast majority of OSCEs, binary checklists account for

the bulk of the marks awarded to candidates.

Recently, however, several authors have advanced

arguments for an increased use of global ratings over

the traditional reliance on checklists. There are several

reasons for these proposals. Firstly, global ratings

appear to have psychometric properties (including

interstation reliability or internal consistency, concur-

rent validity and construct validity) that are at least as

good as, and often better than, those of checklists.1–5

Secondly, there is a growing body of literature that

suggests that clinicians with higher levels of expertise

are highly focused in gathering data and may be

penalised by checklists that reward thoroughness.6–8

Thirdly, the checklists used in most OSCEs use binary

(yes ⁄no) ratings that tend to neglect the higher com-

ponents of clinical competence, such as empathy9 and

the organisation of knowledge, in favour of simple,

linear accumulation of facts.10 Thus there are several

1Wilson Centre for Research in Education, University of Toronto,

Ontario, Canada2Department of Psychiatry and Physical Therapy, University of

Toronto, Ontario, Canada

Correspondence: Dr Brian Hodges, Director, University of Toronto

Wilson Centre for Research in Education, Toronto General Hospital,

1 Eaton ⁄ 200 Elizabeth Street, Toronto, Ontario M5G 2C4, Canada.

Tel.: 00 1 416 340 3646; Fax: 00 1 416 340 3792; E-mail:

[email protected]

Original article

1012 � Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016

reasons to utilise some form of global rating in addition

to binary content checklists.

However, there appears to be great variation in the

literature as to what constitutes a �global� rating scale,

which, to name but a few of its nominal definitions, is

sometimes described as a �process scale�, �communica-

tion scale� or �professional behaviour scale�. Not only is

there a range of domains assessed by the large variety of

non-binary scales used, there is also great variability in

their structure. Elsewhere in the educational research

literature, scoring systems have been more carefully

categorised on a continuum ranging from the most

holistic (or �global�) to the most analytic. Hunter and

colleagues11 have defined 5 broad categories along this

continuum, ranging from the most to the least holistic:

general impression scoring > holistic scoring >

primary trait scoring > analytic scoring > atomistic

scoring. If we adapt this classification to OSCE ratings,

an example of general impression scoring is the final

subjective pass ⁄ fail judgement that is sometimes made

by an examiner at the end of an OSCE station. Further

along the continuum, holistic scoring comprises 1 final

global judgement based on a set of predefined global

descriptors, or on a process of examiner training that is

aimed at increasing the consistency of scoring. Primary

trait scoring involves a set of more precisely defined

performance criteria, including identification of the

specific elements of an effective performance. Because

it is a holistic method, however, there is an assumption

that the defined traits are elements of 1 unified

performance construct and that �the whole produces

the parts�.11 Analytic scoring, in contrast, proceeds

from the assumption that it is the observable and

measurable components of performance that produce

the whole and not the reverse. Thus, the scoring system

breaks the performance into component parts that are

scored individually first and then summed to generate an

overall performance score.11 Finally, at the end of the

continuum, checklists represent �atomistic� scoring

because they involve the counting of small elements

of performance in an effort to �equate scoring with

taking inventory�.11

In the absence of the widespread application of such

a classification system, it becomes difficult to make

comparisons between scales (which themselves are

often omitted from published articles), as authors are

often not explicit about the structure and scoring of

their global ratings. Further, while conclusions are

frequently made about the psychometric superiority of

global ratings, such statements have for the most part

referred to interstation reliability only. To date, some-

what less attention has been paid to the validity of

OSCE global rating scales. This study examined the

construct validity of a specific set of analytic global

ratings that are included in this report. While there are

many ways to explore construct validity, in this study

we sought evidence that global rating scales, both

individually and in aggregate, could differentiate

between students at different levels of training. The

domains (traits) assessed by these behaviourally

anchored ratings were:

1 response to the patient’s feelings and needs;

2 degree of coherence in the interview;

3 verbal expression, and

4 non-verbal expression.

Method

All 350 medical students in the final 2 years of

the 4-year medical programme at the University of

Toronto, Ontario, Canada were offered an opportunity

to participate in the study. The first 60 respondents

were signed up for participation. Of these, 21 were in

Year 3 and 39 were in Year 4 of the programme,

representing 12% and 22% of their classes, respectively.

Complete data were gathered on 57 subjects, 19 in

Year 3 and 38 in Year 4 (3 subjects did not take the

examination due to illness). All subjects were informed

of the voluntary nature of the examination and told that

their scores would be kept confidential. They were paid

a CAN$50Æ00 honorarium for their participation.

Each individual participated in 10 10-minute OSCE

stations. Doctor examiners recruited from the Univer-

sity of Toronto evaluated candidate performance at all

stations. One examiner rated student performance in

each station, using 2 forms. Firstly, a task-specific

binary checklist that was unique to each station was

scored as �done� or �not done ⁄done incorrectly� and

summed for a total score. Secondly, a series of 4 5-point

Key learning points

Checklists currently used in OSCEs are not always

sensitive to higher levels of expertise.

Analytic global ratings can be used to augment

OSCE checklists and can have good psychometric

properties, including construct validity.

The term �global rating� can mean many things and

researchers should be careful to define the type of

ratings used in their research and provide

examples.

Analytic global OSCE ratings • B Hodges & J H McIlroy 1013

� Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016

global ratings was completed to rate the candidate’s

performance in the 4 domains described above:

empathy, coherence, verbal communication and non-

verbal expression (see Appendix 1 for the global rating

form). Scores on the 4 individual, global scales were

summed to create a �summated global rating�. Exam-

iners were blinded as to whether the students they were

observing were in Years 3 or 4.

Stations were created by doctors following a stand-

ardised protocol.12 Ten scenarios were created, based

on real clinical cases and including detailed descriptions

of medical history, personal history, family history and

mental status. The scenarios were similar to those in the

traditional OSCE stations developed and used over

several years in examinations for clinical clerks at the

University of Toronto.12 However, all the stations were

created specifically for this study and therefore would

not have been previously encountered by the students.

Station content covered themajor disciplines of medicine

(2 stations), surgery (2 stations), psychiatry (2 stations),

paediatrics (2 stations) and obstetrics ⁄ gynaecology

(2 stations). All scenarios were set in the context of a

primary care practice and involved history taking

and ⁄or physical examination of common clinical prob-

lems. Standardised patients were selected from an

established Standardised Patient Bank at the University

of Toronto, which thoroughly screens each individual

before they are accepted into the programme. For

the administration of the examination, 3 SPs were

trained for each of the 10 stations by an experienced SP

trainer.

Results

Table 1 shows the mean checklist and global scores over

10 stations for the 2 groups and the t-statistics for

between-group differences in mean scores for Year 3 and

Year 4 students. In addition, mean scores for each of the

4 subscales of the global rating form are shown, with

corresponding t-statistics and P-values. All scores are

presented as percentages of total possible marks

achieved. The data were normally distributed on all

measures. Mean global rating scores were significantly

higher for senior clerks: 75Æ5% versus 71Æ3% (t55 ¼ 2Æ12,

P < 0Æ05). There were also significant differences by

level of training for the subscales assessing coherence

(t55 ¼ 3Æ33, P < 0Æ01) and verbal communication

(t55 ¼ 2Æ33, P < 0Æ05). Interstation reliability was 0Æ54

for checklists, 0Æ70 for global rating and ranged from 0Æ58

to 0Æ65 for global rating subscales (see Table 2).

Conclusions

This research utilised examiner-assessed OSCE stations

and thus cannot be generalised to OSCEs in which global

ratings are completed by SPs. Nevertheless, we feel that

by providing clear information about the exact type of

global rating (general impression, holistic, primary trait

or analytic) used, researchers will be able to draw more

valid conclusions in comparisons with their own meas-

ures. Indeed, we would encourage other researchers to

fully disclose the exact nature of their global ratings when

reporting on this important line of research.

We found the analytic global ratings used in this

study to have both significant construct validity and

higher reliability than checklists. The first finding is

consistent with a growing body of literature. The

second provides some new evidence that global ratings

can have sufficient validity to discriminate between

cohorts that are as close as 1 year of medical training

apart. Perhaps most importantly, however, we have

shown that some individual rating scales are sensitive to

Table 1 Differences in group means for checklist and global rating scores, with t-tests for independent samples

Mean scores (% of total) for checklists,

overall global ratings and subscales

T-test to compare means

for Years 3 and 4

Year 3

(n ¼ 19)

Year 4

(n ¼ 38)

Pooled sample

(n ¼ 57) T-statistic (P-value)

Checklist 51.27 (3.30) 53.42 (5.65) 52.70 (5.07) 1.812 (0.076)

Global 71.34 (7.45) 75.46 (6.65) 74.09 (7.13) 2.118 (0.039)

Empathy 73.16 (8.36) 74.89 (7.82) 74.32 (7.97) 0.773 (0.443)

Coherence 63.58 (7.17) 70.42 (7.37) 68.14 (7.94) 3.334 (0.002)

Verbal communication 75.79 (8.16) 80.26 (6.07) 78.77 (7.09) 2.334 (0.023)

Non-verbal communication 72.84 (7.76) 76.26 (7.33) 75.12 (7.58) 1.630 (0.109)

*Standard deviations around each mean are indicated in brackets.

Analytic global OSCE ratings • B Hodges & J H McIlroy1014


level of training while others are not. Of course, this

would need to be confirmed with replication studies,

particularly as our sample size was small. Our results

are also limited by the manner in which we recruited

subjects. It is possible that a volunteer sample is not

entirely representative of its class.

However, with these limitations in mind, we should

perhaps not be surprised by the fact that scores for

coherence and questioning were higher in more senior

students than in their junior colleagues, whereas scores

for empathy and non-verbal skills were not. Experience

in clinical settings will ideally impact positively on a

medical student’s understanding of common presenting

problems, allowing them to demonstrate a more

coherent, integrated interview. The logic and sequence

of verbal questions would similarly rise with practice.

By contrast, empathy, a quality that is heavily expressed

through non-verbal means, may be a more constant

trait. In any case, these findings have further convinced

us of the need to carefully study the individual global

ratings that we use and not to assume that they can all

be combined to form some sort of holistic representa-

tion of performance. Attention to the individual reliab-

ility and various forms of validity of global ratings will

help us choose those that have the potential to assess

the most important domains of competence, and those

most amenable to educational intervention.

Contributors

Both authors were responsible for the conception,

implementation and analysis of the study as well as

preparation of the manuscript.

Acknowledgements

The authors wish to acknowledge the contribution of

Dr Glenn Regehr to the conception and analysis of the

study.

Funding

This research was funded by an educational grant from

the Stemmler Research and Development Fund of the

National Board of Medical Examiners, USA.

References

1 van der Vleuten CPM, Swanson DB. Assessment of clinical

skills with standardised patients: state of the art. Teach Learn

Med 1990;2 (2):58–76.

2 Regehr G, MacRae H, Reznick R, Szalay D. Comparing the

psychometric properties of checklists and global rating scales

for assessing performance on an OSCE-format examination.

Acad Med 1998;73 (9):993–7.

3 Regehr G, Freeman R, Hodges B, Russell L. Assessing the

generalisability of OSCE measures across content domains.

Acad Med 1999;74:1320–2.

4 Hodges B, Regehr G, Hanson M, McNaughton N. Validation

of an objective structured clinical examination in psychiatry.

Acad Med 1998;73 (8):910–2.

5 Reznick R, Regehr G, Yee G, Rothman A, Blackmore D,

Dauphinee D. Process rating forms versus task specific

checklists in an OSCE for medical licensure. Acad Med

1998;73 (10):97–9.

6 Schmidt HG, Norman GR, Boshuizen E. A cognitive per-

spective on medical expertise: theory and implications. Acad

Med 1990;65:611–21.

7 Dreyfus HL, Dreyfus SE. Mind over Machine. New York: Free

Press 1986.

8 Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson

M. Checklists do not capture increasing levels of expertise.

Acad Med 1999;74:1129–34.

9 Cox K. No Oscar for OSCE. Med Educ 1990;24:540–5.

10 Charlin B, Tardif J, Boshuizen PA. Scripts and medical

diagnostic knowledge: theory and applications for clinical

reasoning instruction and research. Acad Med 2000;72

(2):182–90.

11 Hunter DM, Jones RM, Randhawa BS. The use of holistic

versus analytic scoring for large-scale assessment of writing.

Can J Program Evaluation 1996;11 (2):61–85.

12 (Educating Future Physicians of Ontario). How to Run An

OSCE: Guide for The Beginner. Video 1995.

Received 23 March 2001; editorial comments to authors 29 October

2001; accepted for publication 13 March 2003

Table 2 Alpha (internal consistency) coefficients for checklist

and global rating scores in a 10-station OSCE

All candidates

pooled (n ¼ 57)

Checklist 0.54

Summated global ratings 0.70

Empathy subscale 0.60

Coherence subscale 0.58

Verbal communication 0.65

Non-verbal communication 0.62

Analytic global OSCE ratings • B Hodges & J H McIlroy 1015


Appendix 1 Analytic global rating form used in the OSCE

Response to patient’s feelings and needs (empathy)

1 2 3 4 5

Does not respond to obvious

patient cues (verbal and non-verbal)

and ⁄ or responds inappropriately

Responds to patient’s needs

and cues, but not always

effectively

Responds consistently in a

perceptive and genuine manner

to the patient’s needs and cues

Degree of coherence in the interview

1 2 3 4 5

No recognisable plan to

the interaction; the plan

does not demonstrate

cohesion or the patient

must determine the direction

of the interview

Organisational approach is

formulaic and minimally

flexible and ⁄ or control of

the interview is inconsistent

Superior organisation,

demonstrating command of

cohesive devises, flexibility,

and consistent control of the

interview

Verbal expression

1 2 3 4 5

Communicates in manner

that interferes with and ⁄ or

prevents understanding by

patient, or communicates

inappropriately with the patient

Exhibits sufficient control

of expression to be

understood by an active,

engaged listener (patient)

Exhibits command of expression

(fluency, diction, grammar,

vocabulary, tone, volume and

modulation of voice, rate of speech,

pace and pronunciation

Non-verbal expression

1 2 3 4 5

Fails to engage, frustrates and ⁄ or

antagonises the patient

Exhibits enough control of

non-verbal expression to engage

a patient willing to overlook

deficiencies such as passivity,

self-consciousness or

inappropriate aggressiveness

Exhibits finesse and command

of non-verbal expression

(eye-contact, gesture, posture,

use of silence, etc.)

Overall assessment of the knowledge and skills demonstrated in the interview

A ¼ Incompetent B ¼ Borderline C ¼ Competent

Responds inappropriately and

ineffectively to the task,

indicating a lack of knowledge

and ⁄ or undeveloped

interpersonal and interviewing

skills

Responds effectively to some

components of the task, some

development of interpersonal

and interviewing skills

Responds precisely and

perceptively to the task,

consistently integrating

all components

Developed in collaboration with Cleo Boyd, Communications Consultant, Faculty of Pharmacy, University of Toronto.

Analytic global OSCE ratings • B Hodges & J H McIlroy1016


analytic global osce ratings are sensitive to level of training

Documents