analytic global osce ratings are sensitive to level of training
TRANSCRIPT
Analytic global OSCE ratings are sensitive to level of training
Brian Hodges1 & Jodi Herold McIlroy2
Purpose There are several reasons for using global rat-
ings in addition to checklists for scoring objective
structured clinical examination (OSCE) stations.
However, there has been little evidence collected
regarding the validity of these scales. This study
assessed the construct validity of an analytic global
rating with 4 component subscales: empathy, coher-
ence, verbal and non-verbal expression.
Methods A total of 19 Year 3 and 38 Year 4 clinical clerks
were scored on content checklists and these global
ratings during a 10-station OSCE. T-tests were used to
assess differences between groups for overall checklist
and global scores, and for each of the 4 subscales.
Results The mean global rating was significantly higher
for senior clerks (75Æ5% versus 71Æ3%, t55 ¼ 2Æ12,
P < 0Æ05) and there were significant differences by
level of training for the coherence (t55 ¼ 3Æ33,
P < 0Æ01) and verbal communication (t55 ¼ 2Æ33,
P < 0Æ05) subscales. Interstation reliability was 0Æ70
for the global rating and ranged from 0Æ58 to 0Æ65 for
the subscales. Checklist reliability was 0Æ54.
Conclusion In this study, a summated analytic global
rating demonstrated construct validity, as did 2 of the 4
scales measuring specific traits. In addition, the analytic
global rating showed substantially higher internal con-
sistency than did the checklists, a finding consistent
with that seen in previous studies cited in the literature.
Global ratings are an important element of OSCE
measurement and can have good psychometric prop-
erties. However, OSCE researchers should clearly
describe the type of global ratings they use. Further
research is needed to define the most effective global
rating scales.
Keywords education, medical ⁄ *standards; clinical
clerkships ⁄*standards; *educational measurement;
reproducibility of results; psychometrics/methods.
Medical Education 2003;37:1012–1016
Introduction
The assessment of clinical competence in medical
education has been greatly advanced over the past 2
decades by the widespread use of objective structured
clinical examinations (OSCEs). The OSCE usually
combines the reality of live clinical interactions with the
standardisation of problems and the use of multiple
observers.1 Typically, an OSCE consists of a series of
independently scored stations, each of which involves a
standardised patient (SP) who portrays a clinical
scenario. Scoring at each station is performed either
by an examiner who ticks off items on a binary checklist
while observing the scenario from within the room or
through a 1-way mirror, or by the SP at the end of the
encounter. Following each encounter, the examiner or
SP may also complete 1 or more global ratings
regarding general dimensions of performance. In the
vast majority of OSCEs, binary checklists account for
the bulk of the marks awarded to candidates.
Recently, however, several authors have advanced
arguments for an increased use of global ratings over
the traditional reliance on checklists. There are several
reasons for these proposals. Firstly, global ratings
appear to have psychometric properties (including
interstation reliability or internal consistency, concur-
rent validity and construct validity) that are at least as
good as, and often better than, those of checklists.1–5
Secondly, there is a growing body of literature that
suggests that clinicians with higher levels of expertise
are highly focused in gathering data and may be
penalised by checklists that reward thoroughness.6–8
Thirdly, the checklists used in most OSCEs use binary
(yes ⁄no) ratings that tend to neglect the higher com-
ponents of clinical competence, such as empathy9 and
the organisation of knowledge, in favour of simple,
linear accumulation of facts.10 Thus there are several
1Wilson Centre for Research in Education, University of Toronto,
Ontario, Canada2Department of Psychiatry and Physical Therapy, University of
Toronto, Ontario, Canada
Correspondence: Dr Brian Hodges, Director, University of Toronto
Wilson Centre for Research in Education, Toronto General Hospital,
1 Eaton ⁄ 200 Elizabeth Street, Toronto, Ontario M5G 2C4, Canada.
Tel.: 00 1 416 340 3646; Fax: 00 1 416 340 3792; E-mail:
Original article
1012 � Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016
reasons to utilise some form of global rating in addition
to binary content checklists.
However, there appears to be great variation in the
literature as to what constitutes a �global� rating scale,
which, to name but a few of its nominal definitions, is
sometimes described as a �process scale�, �communica-
tion scale� or �professional behaviour scale�. Not only is
there a range of domains assessed by the large variety of
non-binary scales used, there is also great variability in
their structure. Elsewhere in the educational research
literature, scoring systems have been more carefully
categorised on a continuum ranging from the most
holistic (or �global�) to the most analytic. Hunter and
colleagues11 have defined 5 broad categories along this
continuum, ranging from the most to the least holistic:
general impression scoring > holistic scoring >
primary trait scoring > analytic scoring > atomistic
scoring. If we adapt this classification to OSCE ratings,
an example of general impression scoring is the final
subjective pass ⁄ fail judgement that is sometimes made
by an examiner at the end of an OSCE station. Further
along the continuum, holistic scoring comprises 1 final
global judgement based on a set of predefined global
descriptors, or on a process of examiner training that is
aimed at increasing the consistency of scoring. Primary
trait scoring involves a set of more precisely defined
performance criteria, including identification of the
specific elements of an effective performance. Because
it is a holistic method, however, there is an assumption
that the defined traits are elements of 1 unified
performance construct and that �the whole produces
the parts�.11 Analytic scoring, in contrast, proceeds
from the assumption that it is the observable and
measurable components of performance that produce
the whole and not the reverse. Thus, the scoring system
breaks the performance into component parts that are
scored individually first and then summed to generate an
overall performance score.11 Finally, at the end of the
continuum, checklists represent �atomistic� scoring
because they involve the counting of small elements
of performance in an effort to �equate scoring with
taking inventory�.11
In the absence of the widespread application of such
a classification system, it becomes difficult to make
comparisons between scales (which themselves are
often omitted from published articles), as authors are
often not explicit about the structure and scoring of
their global ratings. Further, while conclusions are
frequently made about the psychometric superiority of
global ratings, such statements have for the most part
referred to interstation reliability only. To date, some-
what less attention has been paid to the validity of
OSCE global rating scales. This study examined the
construct validity of a specific set of analytic global
ratings that are included in this report. While there are
many ways to explore construct validity, in this study
we sought evidence that global rating scales, both
individually and in aggregate, could differentiate
between students at different levels of training. The
domains (traits) assessed by these behaviourally
anchored ratings were:
1 response to the patient’s feelings and needs;
2 degree of coherence in the interview;
3 verbal expression, and
4 non-verbal expression.
Method
All 350 medical students in the final 2 years of
the 4-year medical programme at the University of
Toronto, Ontario, Canada were offered an opportunity
to participate in the study. The first 60 respondents
were signed up for participation. Of these, 21 were in
Year 3 and 39 were in Year 4 of the programme,
representing 12% and 22% of their classes, respectively.
Complete data were gathered on 57 subjects, 19 in
Year 3 and 38 in Year 4 (3 subjects did not take the
examination due to illness). All subjects were informed
of the voluntary nature of the examination and told that
their scores would be kept confidential. They were paid
a CAN$50Æ00 honorarium for their participation.
Each individual participated in 10 10-minute OSCE
stations. Doctor examiners recruited from the Univer-
sity of Toronto evaluated candidate performance at all
stations. One examiner rated student performance in
each station, using 2 forms. Firstly, a task-specific
binary checklist that was unique to each station was
scored as �done� or �not done ⁄done incorrectly� and
summed for a total score. Secondly, a series of 4 5-point
Key learning points
Checklists currently used in OSCEs are not always
sensitive to higher levels of expertise.
Analytic global ratings can be used to augment
OSCE checklists and can have good psychometric
properties, including construct validity.
The term �global rating� can mean many things and
researchers should be careful to define the type of
ratings used in their research and provide
examples.
Analytic global OSCE ratings • B Hodges & J H McIlroy 1013
� Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016
global ratings was completed to rate the candidate’s
performance in the 4 domains described above:
empathy, coherence, verbal communication and non-
verbal expression (see Appendix 1 for the global rating
form). Scores on the 4 individual, global scales were
summed to create a �summated global rating�. Exam-
iners were blinded as to whether the students they were
observing were in Years 3 or 4.
Stations were created by doctors following a stand-
ardised protocol.12 Ten scenarios were created, based
on real clinical cases and including detailed descriptions
of medical history, personal history, family history and
mental status. The scenarios were similar to those in the
traditional OSCE stations developed and used over
several years in examinations for clinical clerks at the
University of Toronto.12 However, all the stations were
created specifically for this study and therefore would
not have been previously encountered by the students.
Station content covered themajor disciplines of medicine
(2 stations), surgery (2 stations), psychiatry (2 stations),
paediatrics (2 stations) and obstetrics ⁄ gynaecology
(2 stations). All scenarios were set in the context of a
primary care practice and involved history taking
and ⁄or physical examination of common clinical prob-
lems. Standardised patients were selected from an
established Standardised Patient Bank at the University
of Toronto, which thoroughly screens each individual
before they are accepted into the programme. For
the administration of the examination, 3 SPs were
trained for each of the 10 stations by an experienced SP
trainer.
Results
Table 1 shows the mean checklist and global scores over
10 stations for the 2 groups and the t-statistics for
between-group differences in mean scores for Year 3 and
Year 4 students. In addition, mean scores for each of the
4 subscales of the global rating form are shown, with
corresponding t-statistics and P-values. All scores are
presented as percentages of total possible marks
achieved. The data were normally distributed on all
measures. Mean global rating scores were significantly
higher for senior clerks: 75Æ5% versus 71Æ3% (t55 ¼ 2Æ12,
P < 0Æ05). There were also significant differences by
level of training for the subscales assessing coherence
(t55 ¼ 3Æ33, P < 0Æ01) and verbal communication
(t55 ¼ 2Æ33, P < 0Æ05). Interstation reliability was 0Æ54
for checklists, 0Æ70 for global rating and ranged from 0Æ58
to 0Æ65 for global rating subscales (see Table 2).
Conclusions
This research utilised examiner-assessed OSCE stations
and thus cannot be generalised to OSCEs in which global
ratings are completed by SPs. Nevertheless, we feel that
by providing clear information about the exact type of
global rating (general impression, holistic, primary trait
or analytic) used, researchers will be able to draw more
valid conclusions in comparisons with their own meas-
ures. Indeed, we would encourage other researchers to
fully disclose the exact nature of their global ratings when
reporting on this important line of research.
We found the analytic global ratings used in this
study to have both significant construct validity and
higher reliability than checklists. The first finding is
consistent with a growing body of literature. The
second provides some new evidence that global ratings
can have sufficient validity to discriminate between
cohorts that are as close as 1 year of medical training
apart. Perhaps most importantly, however, we have
shown that some individual rating scales are sensitive to
Table 1 Differences in group means for checklist and global rating scores, with t-tests for independent samples
Mean scores (% of total) for checklists,
overall global ratings and subscales
T-test to compare means
for Years 3 and 4
Year 3
(n ¼ 19)
Year 4
(n ¼ 38)
Pooled sample
(n ¼ 57) T-statistic (P-value)
Checklist 51.27 (3.30) 53.42 (5.65) 52.70 (5.07) 1.812 (0.076)
Global 71.34 (7.45) 75.46 (6.65) 74.09 (7.13) 2.118 (0.039)
Empathy 73.16 (8.36) 74.89 (7.82) 74.32 (7.97) 0.773 (0.443)
Coherence 63.58 (7.17) 70.42 (7.37) 68.14 (7.94) 3.334 (0.002)
Verbal communication 75.79 (8.16) 80.26 (6.07) 78.77 (7.09) 2.334 (0.023)
Non-verbal communication 72.84 (7.76) 76.26 (7.33) 75.12 (7.58) 1.630 (0.109)
*Standard deviations around each mean are indicated in brackets.
Analytic global OSCE ratings • B Hodges & J H McIlroy1014
� Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016
level of training while others are not. Of course, this
would need to be confirmed with replication studies,
particularly as our sample size was small. Our results
are also limited by the manner in which we recruited
subjects. It is possible that a volunteer sample is not
entirely representative of its class.
However, with these limitations in mind, we should
perhaps not be surprised by the fact that scores for
coherence and questioning were higher in more senior
students than in their junior colleagues, whereas scores
for empathy and non-verbal skills were not. Experience
in clinical settings will ideally impact positively on a
medical student’s understanding of common presenting
problems, allowing them to demonstrate a more
coherent, integrated interview. The logic and sequence
of verbal questions would similarly rise with practice.
By contrast, empathy, a quality that is heavily expressed
through non-verbal means, may be a more constant
trait. In any case, these findings have further convinced
us of the need to carefully study the individual global
ratings that we use and not to assume that they can all
be combined to form some sort of holistic representa-
tion of performance. Attention to the individual reliab-
ility and various forms of validity of global ratings will
help us choose those that have the potential to assess
the most important domains of competence, and those
most amenable to educational intervention.
Contributors
Both authors were responsible for the conception,
implementation and analysis of the study as well as
preparation of the manuscript.
Acknowledgements
The authors wish to acknowledge the contribution of
Dr Glenn Regehr to the conception and analysis of the
study.
Funding
This research was funded by an educational grant from
the Stemmler Research and Development Fund of the
National Board of Medical Examiners, USA.
References
1 van der Vleuten CPM, Swanson DB. Assessment of clinical
skills with standardised patients: state of the art. Teach Learn
Med 1990;2 (2):58–76.
2 Regehr G, MacRae H, Reznick R, Szalay D. Comparing the
psychometric properties of checklists and global rating scales
for assessing performance on an OSCE-format examination.
Acad Med 1998;73 (9):993–7.
3 Regehr G, Freeman R, Hodges B, Russell L. Assessing the
generalisability of OSCE measures across content domains.
Acad Med 1999;74:1320–2.
4 Hodges B, Regehr G, Hanson M, McNaughton N. Validation
of an objective structured clinical examination in psychiatry.
Acad Med 1998;73 (8):910–2.
5 Reznick R, Regehr G, Yee G, Rothman A, Blackmore D,
Dauphinee D. Process rating forms versus task specific
checklists in an OSCE for medical licensure. Acad Med
1998;73 (10):97–9.
6 Schmidt HG, Norman GR, Boshuizen E. A cognitive per-
spective on medical expertise: theory and implications. Acad
Med 1990;65:611–21.
7 Dreyfus HL, Dreyfus SE. Mind over Machine. New York: Free
Press 1986.
8 Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson
M. Checklists do not capture increasing levels of expertise.
Acad Med 1999;74:1129–34.
9 Cox K. No Oscar for OSCE. Med Educ 1990;24:540–5.
10 Charlin B, Tardif J, Boshuizen PA. Scripts and medical
diagnostic knowledge: theory and applications for clinical
reasoning instruction and research. Acad Med 2000;72
(2):182–90.
11 Hunter DM, Jones RM, Randhawa BS. The use of holistic
versus analytic scoring for large-scale assessment of writing.
Can J Program Evaluation 1996;11 (2):61–85.
12 (Educating Future Physicians of Ontario). How to Run An
OSCE: Guide for The Beginner. Video 1995.
Received 23 March 2001; editorial comments to authors 29 October
2001; accepted for publication 13 March 2003
Table 2 Alpha (internal consistency) coefficients for checklist
and global rating scores in a 10-station OSCE
All candidates
pooled (n ¼ 57)
Checklist 0.54
Summated global ratings 0.70
Empathy subscale 0.60
Coherence subscale 0.58
Verbal communication 0.65
Non-verbal communication 0.62
Analytic global OSCE ratings • B Hodges & J H McIlroy 1015
� Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016
Appendix 1 Analytic global rating form used in the OSCE
Response to patient’s feelings and needs (empathy)
1 2 3 4 5
Does not respond to obvious
patient cues (verbal and non-verbal)
and ⁄ or responds inappropriately
Responds to patient’s needs
and cues, but not always
effectively
Responds consistently in a
perceptive and genuine manner
to the patient’s needs and cues
Degree of coherence in the interview
1 2 3 4 5
No recognisable plan to
the interaction; the plan
does not demonstrate
cohesion or the patient
must determine the direction
of the interview
Organisational approach is
formulaic and minimally
flexible and ⁄ or control of
the interview is inconsistent
Superior organisation,
demonstrating command of
cohesive devises, flexibility,
and consistent control of the
interview
Verbal expression
1 2 3 4 5
Communicates in manner
that interferes with and ⁄ or
prevents understanding by
patient, or communicates
inappropriately with the patient
Exhibits sufficient control
of expression to be
understood by an active,
engaged listener (patient)
Exhibits command of expression
(fluency, diction, grammar,
vocabulary, tone, volume and
modulation of voice, rate of speech,
pace and pronunciation
Non-verbal expression
1 2 3 4 5
Fails to engage, frustrates and ⁄ or
antagonises the patient
Exhibits enough control of
non-verbal expression to engage
a patient willing to overlook
deficiencies such as passivity,
self-consciousness or
inappropriate aggressiveness
Exhibits finesse and command
of non-verbal expression
(eye-contact, gesture, posture,
use of silence, etc.)
Overall assessment of the knowledge and skills demonstrated in the interview
A ¼ Incompetent B ¼ Borderline C ¼ Competent
Responds inappropriately and
ineffectively to the task,
indicating a lack of knowledge
and ⁄ or undeveloped
interpersonal and interviewing
skills
Responds effectively to some
components of the task, some
development of interpersonal
and interviewing skills
Responds precisely and
perceptively to the task,
consistently integrating
all components
Developed in collaboration with Cleo Boyd, Communications Consultant, Faculty of Pharmacy, University of Toronto.
Analytic global OSCE ratings • B Hodges & J H McIlroy1016
� Blackwell Publishing Ltd MEDICAL EDUCATION 2003;37:1012–1016