does the think-aloud protocol reflect thinking? exploring functional neuroimaging differences with...

7
2013 2013; 35: 720–726 Does the think-aloud protocol reflect thinking? Exploring functional neuroimaging differences with thinking (answering multiple choice questions) versus thinking aloud STEVEN J. DURNING 1 , ANTHONY R. ARTINO JR. 1 , THOMAS J. BECKMAN 2 , JOHN GRANER 1 , CEES VAN DER VLEUTEN 3 , ERIC HOLMBOE 4 & LAMBERT SCHUWIRTH 5 1 Uniformed Services University of the Health Sciences, USA, 2 Mayo Clinic, USA, 3 Maastricht University, the Netherlands, 4 American Board of Internal Medicine, USA, 5 Flinders University, Australia Abstract Background: Whether the think-aloud protocol is a valid measure of thinking remains uncertain. Therefore, we used functional magnetic resonance imaging (fMRI) to investigate potential functional neuroanatomic differences between thinking (answering multiple-choice questions in real time) versus thinking aloud (on review of items). Methods: Board-certified internal medicine physicians underwent formal think-aloud training. Next, they answered validated multiple-choice questions in an fMRI scanner while both answering (thinking) and thinking aloud about the questions, and we compared fMRI images obtained during both periods. Results: Seventeen physicians (15 men and 2 women) participated in the study. Mean physician age was 39.5 þ 7 (range: 32–51 years). The mean number of correct responses was 18.5/32 questions (range: 15–25). Statistically significant differences were found between answering (thinking) and thinking aloud in the following regions: motor cortex, bilateral prefrontal cortex, bilateral cerebellum, and the basal ganglia ( p 5 0.01). Discussion: We identified significant differences between answering and thinking aloud within the motor cortex, prefrontal cortex, cerebellum, and basal ganglia. These differences were by degree (more focal activation in these areas with thinking aloud as opposed to answering). Prefrontal cortex and cerebellum activity was attributable to working memory. Basal ganglia activity was attributed to the reward of answering a question. The identified neuroimaging differences between answering and thinking aloud were expected based on existing theory and research in other fields. These findings add evidence to the notion that the think-aloud protocol is a reasonable measure of thinking. Background Clinical reasoning lies at the heart of any successful clinical practice. Although there have been numerous definitions of clinical reasoning, they all converge on the idea that clinical reasoning entails cognitive operations that allow physicians to observe, interpret and analyze information, and determine diagnoses and further management (Higgs & Jones 2008). Studying the phenomenon of clinical reasoning is no simple matter, mainly because cognitive operations cannot be observed directly and therefore must be inferred from observable behavior. A major challenge with conducting research on cognition pertains to the potentially confounding factors that exist in the sequence from cognitive processes (i.e. intermediate steps to the diagnosis or therapy) to the eventual observable behaviors (i.e. the final diagnosis or therapy for a patient). According to Ericsson and Simon (1987), ‘It is important to note that any observable behavior used as data for a thought process requires an explicit account of its relation to the states of the thought processes and any mediating additional cognitive processes’. One commonly used method for studying cognition is the think-aloud protocol. Currently, think-aloud protocol Practice points . Assessing clinical reasoning is challenging as cognitive operations cannot be directly observed. . A commonly used method for studying cognition is the think-aloud protocol. . Functional MRI introduces the possibility of examining the relationship between think-aloud data and brain activation patterns. . Our exploratory findings provide some neuroimaging evidence that the think-aloud protocol can help educa- tors with assessing cognition Correspondence: Steven J. Durning, MD, PhD, FACP, Professor of Medicine and Pathology, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda MD 20814-4799, USA. Tel: þ1301-295-3603; fax: þ1301-295-5792; email: [email protected] 720 ISSN 0142–159X print/ISSN 1466–187X online/13/90720–7 ß 2013 Informa UK Ltd. DOI: 10.3109/0142159X.2013.801938 Med Teach Downloaded from informahealthcare.com by Monash University on 09/06/13 For personal use only.

Upload: lambert

Post on 12-Dec-2016

218 views

Category:

Documents


4 download

TRANSCRIPT

2013

2013; 35: 720–726

Does the think-aloud protocol reflect thinking?Exploring functional neuroimaging differenceswith thinking (answering multiple choicequestions) versus thinking aloud

STEVEN J. DURNING1, ANTHONY R. ARTINO JR.1, THOMAS J. BECKMAN2, JOHN GRANER1,CEES VAN DER VLEUTEN3, ERIC HOLMBOE4 & LAMBERT SCHUWIRTH5

1Uniformed Services University of the Health Sciences, USA, 2Mayo Clinic, USA, 3Maastricht University, the Netherlands,4American Board of Internal Medicine, USA, 5Flinders University, Australia

Abstract

Background: Whether the think-aloud protocol is a valid measure of thinking remains uncertain. Therefore, we used functional

magnetic resonance imaging (fMRI) to investigate potential functional neuroanatomic differences between thinking (answering

multiple-choice questions in real time) versus thinking aloud (on review of items).

Methods: Board-certified internal medicine physicians underwent formal think-aloud training. Next, they answered validated

multiple-choice questions in an fMRI scanner while both answering (thinking) and thinking aloud about the questions, and we

compared fMRI images obtained during both periods.

Results: Seventeen physicians (15 men and 2 women) participated in the study. Mean physician age was 39.5þ 7 (range: 32–51

years). The mean number of correct responses was 18.5/32 questions (range: 15–25). Statistically significant differences were

found between answering (thinking) and thinking aloud in the following regions: motor cortex, bilateral prefrontal cortex, bilateral

cerebellum, and the basal ganglia (p5 0.01).

Discussion: We identified significant differences between answering and thinking aloud within the motor cortex, prefrontal

cortex, cerebellum, and basal ganglia. These differences were by degree (more focal activation in these areas with thinking aloud

as opposed to answering). Prefrontal cortex and cerebellum activity was attributable to working memory. Basal ganglia activity

was attributed to the reward of answering a question. The identified neuroimaging differences between answering and thinking

aloud were expected based on existing theory and research in other fields. These findings add evidence to the notion that the

think-aloud protocol is a reasonable measure of thinking.

Background

Clinical reasoning lies at the heart of any successful clinical

practice. Although there have been numerous definitions of

clinical reasoning, they all converge on the idea that clinical

reasoning entails cognitive operations that allow physicians to

observe, interpret and analyze information, and determine

diagnoses and further management (Higgs & Jones 2008).

Studying the phenomenon of clinical reasoning is no

simple matter, mainly because cognitive operations cannot be

observed directly and therefore must be inferred from

observable behavior. A major challenge with conducting

research on cognition pertains to the potentially confounding

factors that exist in the sequence from cognitive processes (i.e.

intermediate steps to the diagnosis or therapy) to the eventual

observable behaviors (i.e. the final diagnosis or therapy for a

patient). According to Ericsson and Simon (1987), ‘It is

important to note that any observable behavior used as data

for a thought process requires an explicit account of its relation

to the states of the thought processes and any mediating

additional cognitive processes’.

One commonly used method for studying cognition is the

think-aloud protocol. Currently, think-aloud protocol

Practice points

. Assessing clinical reasoning is challenging as cognitive

operations cannot be directly observed.

. A commonly used method for studying cognition is the

think-aloud protocol.

. Functional MRI introduces the possibility of examining

the relationship between think-aloud data and brain

activation patterns.

. Our exploratory findings provide some neuroimaging

evidence that the think-aloud protocol can help educa-

tors with assessing cognition

Correspondence: Steven J. Durning, MD, PhD, FACP, Professor of Medicine and Pathology, Uniformed Services University of the Health Sciences,

4301 Jones Bridge Road, Bethesda MD 20814-4799, USA. Tel: þ1301-295-3603; fax: þ1301-295-5792; email: [email protected]

720 ISSN 0142–159X print/ISSN 1466–187X online/13/90720–7 � 2013 Informa UK Ltd.

DOI: 10.3109/0142159X.2013.801938

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.

methodology, used either during the task or retrospectively

after the task, is seen as an optimal methodology to capture

thought processes (Ericsson 2006). Think-aloud protocols

have been used to study clinical reasoning. Nonetheless,

there has been an extensive debate about the validity of think-

aloud methodology. Emerging methods, such as functional

magnetic resonance imagining (fMRI), may help resolve the

ongoing debate.

There are two general categories of confounding factors

during think-aloud protocols, reactiveness and nonverdicality

(Russo et al. 1989). Reactiveness pertains to the effect that

verbalization has on the underlying cognitive processes.

According to information-processing theory (Newell & Simon

1972), recently acquired information is stored and processed in

the short-term memory (STM) as a series of consecutive ‘steps’,

which eventually lead to the outcome of the problem-solving

process. Asking someone to verbalize his/her thoughts places

additional requirements on the STM. In the simplest case, this

may prolong the response time because the act of verbaliza-

tion adds to the time required for the ‘internal voice’ or

thinking (Ericsson et al. 2006). In more complex cases,

verbalization could interrupt the ‘internal voice’ in the STM,

cause loss of information from the STM, and fundamentally

change the cognitive processes. Despite this theoretical

concern and lack of empirical evidence, many believe that

‘the verbal protocol procedure slows down the process slightly

but does not change it fundamentally’ (Russo et al. 1989).

The second source of confounding is nonveridicality,

which is based on differences between the information that

is processed during unprompted versus think-aloud situations.

Errors of omission and commission are relevant to nonveridi-

cality and are analogous to Messick’s construct underrepre-

sentation and construct-irrelevant variance (Messick 1989).

Thus far, reactiveness and nonveridicality have been studied

mainly by comparing different types of think-aloud procedures

or by comparing the performances with think-aloud proced-

ures that use instruction versus no instruction (Russo et al.

1989). Notably, all these studies have investigated various

aspects of translating cognitive processes to observable

behaviors (Ericsson et al. 2006). However, we could not

identify previous research that uses functional neuroimaging

and think-aloud protocols to study clinical reasoning.

The recent emergence of fMRI introduces the possibility to

examining the relationship between think-aloud data and

visualization of the artifacts of thought processes within

specific brain regions. Previous work (Rosen et al. 2000) has

compared neuroimaging differences that occur when thinking-

aloud versus vocalizing silently with a stem completion task

(three-letter words completed into a normal English word).

This study found differences between these activities for both

word retrieval and comparison within the primary motor

cortex, frontal operculum, and the dorsolateral cortex; these

differences were attributable to the motor activation associated

with thinking aloud (speaking) versus silent vocalization. Also

identified were asymmetric left thalamus and putamen activa-

tions with speaking versus thinking. Overall, the authors

concluded that fMRI findings associated with speaking versus

thinking differ only with respect to the motor areas, although

the differential activations of the putamen and thalamus cannot

be explained from motor activity alone. In another study,

Klasen et al. (2008) used fMRI to identify hypothesized brain

activations that occur with emotional states and verbalized

feelings among subjects playing computer games. They found

good matches between the fMRI data and behavioral and

experiential content, and concluded that their results added to

the validity of verbal protocols. Based on all these findings, it

would seem axiomatic that anticipated functional neuroima-

ging changes occur during clinical reasoning, though we could

find no published research regarding the use of fMRI and

think-aloud protocols for studying clinical reasoning among

physicians.

Many diagnosis and treatment mistakes are due to pre-

ventable cognitive errors (Brennan et al. 1991; Croskerry

2003). Indeed, physician certification and recertification

attempt to measure cognitive ability through the use of

multiple-choice questions (MCQs) that pose diagnostic or

treatment questions. These questions are the ‘gold standard’

for physician competence because they have undergone

rigorous psychometric analysis and have reliability and validity

evidenced that is based on thousands of test-takers. We sought

to determine if the functional neuroimaging of thinking

(answering) differed from thinking aloud about validated

MCQs that assess the construct of clinical reasoning. The

identification of minimal differences could provide physio-

logical, criterion-related validity evidence for using the think-

aloud protocol in educational research.

Methods

Participants

Participants were board-certified internal medicine attending

physicians with faculty appointments at the Uniformed

Services University of the Health Sciences (USU). Exclusion

criteria were the presence of shrapnel or surgical metal

devices, inability to complete an fMRI due to anxiety or

claustrophobia, taking calcium channel blockers, which can

impact regional blood flow, or pregnancy. The study was

approved by the Institutional Review Boards of the Uniformed

Services University and the Walter Reed Army Medical Center

and participants provided informed consent.

Measurements

Multiple-Choice Questions (MCQs). We used validated MCQs

from the American Board of Internal Medicine (ABIM), which

is the organization responsible for the certification and

recertification of internal medicine physicians in the United

States, and MCQs from the National Board of Medical

Examiners (NBME), to assess physician expertise. The specific

NBME items chosen were those on the United States Medical

Licensing Exam (USMLE) Step 2 Clinical Knowledge (CK)

Examination. All MCQs from both the ABIM and NBME are

validated on thousands of subjects and are then modified, as

needed, to optimize their psychometric performance (Melnick

2009). Given the small number of items answered in the

scanner, we selected questions from two fields in medicine:

cardiology and rheumatology. The MCQs selected required the

Does the think-aloud protocol reflect thinking

721

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.

integration and synthesis of data. Participants answered a total

of 32 questions: 16 NBME items (United States Medical

Licensing Exam Step 2 Clinical Knowledge items) and 16

ABIM items (Maintenance of Certification [MOC] MCQ) that are

currently being used to recertify physicians.

We selected questions that fit on a single screen and

contained only words (i.e. no images), given limitations on

screen size and resolution. The selected items also had

favorable discrimination, such that high-performing individuals

answered the hard questions correctly and only the lowest

performing individuals missed the easy items. In addition, the

MCQ format (participants pushed handheld buttons for

answer options ‘A’ to ‘E’) made them ideal for use in the

fMRI scanner, which eliminated the need for participants to

speak, as jaw motion impairs fMRI image interpretation. As

participants were asked to remain completely still in the fMRI

scanner, formally assessing for evidence of subvocalization

was not possible. The questions selected focused on diagnos-

tic reasoning; they all posed a vignette and then asked

participants ‘What is the most likely diagnosis’ or a related

diagnostic question.

fMRI process

Prior to entering the fMRI scanner, participants were formally

trained in the think-aloud procedure using standard guidelines

(Ericsson & Simon 1984). Thinking aloud involves stating

whatever comes to mind (all the information that passes

through a subject’s mind) as one works through a task

(Ericsson et al. 2006). Following this training and completion

of a pre-fMRI questionnaire, participants entered the fMRI

scanner.

The details of the fMRI data acquisition and assessment

methodology are described in Appendix 1. Subjects were

scanned on a 3T Signa MRI scanner (General Electric,

Milwaukee, WI) with a 32-channel head coil. The 32 questions

were presented to participants in a random order.

Each MCQ was projected to participants in the fMRI scanner

in three phases. In the first phase, the stem (question)

appeared (‘reading’ phase). Each question stem ended with

‘what is the most likely diagnosis?’ or a related diagnostic

question. The participants were given 60 s to read the stem and

push any button to move on to the answer options (the second

or ‘answering’ phase). Subjects were then given 7 s to choose

an answer option using the finger buttons. Once the answer

options were presented the subjects could not return to the

previous (‘question’) screen. After the answering phase the

final phase (‘think-aloud’ phase) was entered. Here, partici-

pants were instructed to ‘think aloud’, without speaking (jaw

motion would compromise functional neuroanatomic images),

about how they arrived at their chosen answer (‘how did you

establish the diagnosis for this item’; modified think-aloud).

The ‘think aloud’ phase lasted 14 s.

There was no formal resting period between questions.

Instead, the ‘reading’ phase was used as the baseline for

making comparisons with the answering and thinking aloud

phases, which were the focus of the current study. We

believed that the reading phase would represent a better

baseline for comparison than an unguided ‘rest’ period,

yielding more meaningful, task-specific findings. This was

also done to attempt to control for reading – we were

interested in answering and thinking about (thinking aloud) on

one’s answers. We compared ‘answering phase’ images with

‘think-aloud phase’ images to ascertain for the presence of

functional neuroimaging differences with answering and

thinking aloud on these ‘gold standard’ items used for licensing

physicians in the United States.

The participants also underwent a formal think-aloud

protocol immediately after the imaging session. In this session

participants’ utterances to the identical items in the scanner

were captured. This formal think-aloud session was done to

help validate the pre-fMRI think-aloud training and the thought

processes elicited while going through the think-aloud phase

in the scanner.

Data analysis

All fMRI data were processed using the AFNI software package

(Cox 1996). Neuroimaging activation analysis was performed

using a general linear model (GLM) approach concatenating

the four data sets for each subject. Hemodynamic response

estimates were modeled for the answering and thinking aloud

question phases. As a first level of analysis, contrasts were

made between the ‘answer’ and ‘think-aloud’ significance

estimates in each voxel for each subject. A group analysis

comparing this contrast across all 17 subjects was also

performed using a linear mixed-effects modeling approach.

For determining functional neuroimaging differences with

thinking (answering) and thinking aloud, we compared the

answer phase with the think-aloud phase. This was done to

capture the hemodynamic differences that might be seen

when participants decided on an answer (answering phase)

versus when they thought about their answer (think-aloud

phase).

Results

Seventeen physicians (15 men and 2 women) participated in

the study. Mean physician age was 39.5� 7 (range: 32–51

years). Two participants’ fMRI data sets were dropped due to

excessive motion in the scanner and thus the final cohort was

15 physicians. The mean number of correct responses was

18.5/32 questions (range 15–25). Participant specialties were: 9

general internal medicine, 3 cardiology, 2 pulmonary/critical

care, 1 gastroenterology, 1 infectious diseases, and 1

hematology-oncology. All participants were board certified in

internal medicine, and all but one in their respective

subspecialty (a cardiology fellow).

Functional neuroimaging differences were exploratory,

examining all areas of the brain between answering (thinking)

and thinking aloud phases are presented below (Table 1 and

Figure 1). Statistically significant differences were found

between answering (thinking) and thinking aloud in the

following brain regions: motor cortex, bilateral prefrontal

cortex, bilateral cerebellum, and the basal ganglia (p5 0.01).

These changes were also seen when conducting the analysis

separately for easy and hard items.

S. J. Durning et al.

722

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.

There was more activation with answering compared to

think-aloud for the following areas: motor cortex (which was

expected as participants pushed the buttons in the answering

phase but not in the think-aloud phase), bilateral prefrontal

cortex, bilateral cerebellum, and the caudate nucleus and

putamen. Brain activity was greater with thinking aloud versus

answering for the following areas (all p5 0.01): occipital

cortex, precuneus (PCC).

Formal think-aloud data obtained following exit from the

scanner revealed that subjects were properly trained in the

think-aloud procedure by the rich descriptions provided and

spontaneity of their responses. Respondents did not state that

they forgot items and stated that they were able to focus on the

task in the fMRI scanner.

Discussion

To our knowledge, this is the first study to demonstrate

differences in functional neuroimaging between answering

versus thinking aloud about validated MCQs among board-

certified physicians. Specifically, we identified differences

between answering versus thinking aloud that pertained to

the prefrontal cortex, basal ganglia, caudate, putamen, occipi-

tal cortex, and cerebellum. The differences between answering

and thinking aloud occurred within the same, but not between

different, brain regions.

These findings provide new insights into memory retrieval

and clinical reasoning among physicians, along with prelim-

inary and novel physiological, criterion validity evidence for

using the think-aloud protocol in educational research. What

follows is a detailed description of how the neuroimaging

findings differed for answering versus thinking aloud and how

our findings build upon previous neuroimaging research.

Prefrontal cortex activation

We identified statistically significant increases in the prefrontal

cortex and the basal ganglia for answering questions versus

thinking aloud. The prefrontal cortex is widely believed to be

involved in executive functions, such as decision making

(Stuss & Knight 2002), which is consistent with our finding

of prefrontal activation with silent thinking. The prefrontal

cortex has also been implicated as a potential area for

working memory processing (Baddeley 1986, 1992;

Sammer et al. 2006). With reviewing the question a second

time during thinking aloud, it is plausible that less working

memory resources would be activated resulting in a greater

signal within this area with answering versus thinking aloud

(i.e. a more focal signal in the area was activated with thinking-

aloud). Our research expands upon the study by Rosen et al.

(2000), which did not identify changes in the prefrontal cortex

with answering versus thinking aloud. Our findings may differ

from those of Rosen et al because their study involved the

simple task of retrieving and comparing words, whereas our

study involved a much more intricate task of solving complex

clinical problems through the use of carefully validated MCQs.

Caudate and precuneus

The caudate and putamen are structures in the basal ganglia

that have recently been implicated for their roles in learning

and memory. We found increased activation in these areas

with answering versus thinking aloud on items, which is

consistent with prior research (Graybiel 2005). Similarly, recent

fMRI studies (Cavanna & Trimble 2006) implicate the

precuneus in both self-referential goal-directed actions and

memory retrieval. Additionally, the striatal regions are also rich

in dopaminergic neurons, which have been shown to be

central to reward mechanisms (Koepp et al. 1988). The

increased activation in these regions suggests that the act of

answering the question may produce a reward signal, which is

likely in this group of highly motivated individuals accustomed

to being tested. Likewise, recent fMRI work has shown

activation of the medial prefrontal cortex and the precuneus

with reasoning linked with emotionally salient (or ‘hot’

reasoning; Shaefer & Rotte 2007) which is consistent with

our study of expert physicians answering examination ques-

tions used for licensure in the fMRI scanner. The precuneus

has also been implicated as an area of working memory

(Sammer et al. 2006).

Table 1. Statistically Significant Contrasts (comparisons) with Functional Neuroimaging*.

MNI Coordinates**

Comparison Location X Y Z Corrected p value

Answer vs. think-aloud

occipital cortex �6 71 6 50.01

L basal ganglia 18 �14 6 50.01

R basal ganglia �14 �6 2 50.01

R PFC 40 �26 22 50.01

L PFC �52 �8 28 50.01

L cerebellum 19 52 21 50.01

R cerebellum �26 62 �35 50.01

Precuneus 2 48 32 50.01

PFC, prefrontal cortex

MNI, Montreal Neurological Institute; brain coordinates that differed between answering and reflecting (by X, Y, Z spatial brain dimensions)

Does the think-aloud protocol reflect thinking

723

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.

Occipital cortex

Another significant finding is that the occipital cortex was more

activated during think aloud versus answering items. One

potential explanation is that our physician experts were

visualizing actual clinical scenarios while reflecting on their

answers. Another potential explanation is that different back-

ground colors for items were used as a cue to participants that

they were in the think-aloud period of each MCQ. Different

colors have been shown to trigger various changes in brain

activation (Beuachamp, 1999). Also, the finding of greater mid

occipital gyrus activation while answering versus thinking

aloud may be due to its potential role in working memory

(Baddeley 1986, 1992). We would anticipate that repetitive,

immediate processing of how one arrived at an answer on a

question they just answered (thinking aloud after answering an

item) would result in less excitation of working memory

(because pathways have been just established) than with the

initial answer. This hypothesis, however, should be tested in

future research as it is also possible that activities in the STM

are different between thinking aloud and actual thoughts,

which would have negative consequences for both the

verdicality and reactiveness of this method.

Cerebellum

Our finding of increased cerebellum activation with answering

items versus thinking aloud is also consistent with emerging

work implicating the cerebellum as a working memory area

(Desmond 2003). Similar to other fields (Baddeley 1986;

Anderson 2010), and as noted above, we found several areas

that may be involved in working memory with clinical

reasoning tasks.

We would expect that the act of answering (versus thinking

aloud) would involve more working memory resources

(prefrontal cortex, cerebellum) and also increased activation

of the striatal regions that are involved in both answering and

reward; it is also plausible that the increased activation of the

precuneus is due to working memory (Sammer et al. 2006).

More specifically, we would expect that activation of working

memory resources would be less (activated) upon thinking

aloud on the same item, immediately after answering the item,

as learning could occur between seeing the item the first and

the second time. It is also possible that these changes reflect

differences in verdicality of the method though we believe that

our hypotheses are consistent with current theory thus

providing some evidence, which should, however, be con-

firmed in larger studies. Our expectation is consistent with

emerging theory on complex cognition: many cognitive

functions, such as working memory, can be performed by

more than one area (Just & Varma 2007).

The occipital differences are most consistent with changes

in the screen color, and motor region differences are attrib-

utable to the participants physically pushing buttons for their

A–E answers. Indeed, motor differences were expected and

this finding supports the validity of our approach.

Furthermore, participants underwent a formal think-aloud

procedure immediately after leaving the fMRI scanner and the

time for thinking aloud as well as the quality of their

spontaneous utterances (remembering the items and the

quality of elicited thought processes) provides some additional

evidence to support the adequacy of think-aloud training and

our modified think-aloud approach.

The observed functional neuroimaging differences are

consistent with current understanding of clinical reasoning

theory and provide some evidence to support the use of think-

aloud protocols for measuring cognitive processes. The

presumed bias associated with ‘thinking about one’s thinking’

for purposes of these protocols may be inaccurate (Ericsson &

Simon 1984; van der Vleuten & Newble 1995; Ericsson et al.

Figure 1. Regional differences with answering versus

thinking aloud: panel A: motor cortex, B: bilateral prefrontal

cortex, C: bilateral cerebellum, and the D; basal ganglia

showed more activation while answering (red) the multiple-

choice questions than while thinking aloud upon arriving at a

chosen answer. Regions of the occipital cortex and precuneus

showed greater activation (blue) while thinking aloud than

while answering.

S. J. Durning et al.

724

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.

2006). Our findings would support the practice of adding

think-aloud protocols to tools that measure behaviors, such as

post-encounter forms in Objective Structured Clinical Exams or

answers to MCQs. Our preliminary findings would also

support the continued use of think-aloud protocols alone for

exploring clinical reasoning. This is encouraging as think-

aloud protocols are a feasible and relatively inexpensive

especially when compared with methods such as fMRI.

This study has several important limitations. First, it may

have had insufficient power to detect fMRI differences in other

brain regions. The sample size, however, would generally be

considered ‘acceptable to robust’ in the fMRI literature.

Alternatively, we may not have detected differences in other

brain regions due to the complexity of the assessed construct.

For example, brain imaging research investigating creativity

(and arguably, at least in some cases, diagnosing and treating a

patient is a creative event) has found that it is a challenging

phenomenon to map likely because creative performance is

difficult to operationalize in the experimental setting (Fink

et al. 2006). Likewise, we know surprisingly little about the

brain processes that generate insights in problem solving (Luo

2006). Also, recent fMRI work suggests that each cortical area

can perform multiple cognitive functions (Just & Varma 2007).

Second, we may have failed to find fMRI differences due to

range restriction from sampling only experts, though our wide

performance means and standard deviations for items would

argue against this. However, future work should consider

comparing experts to novices. Third, we did not hear the

participants’ utterances during the modified think-aloud in the

fMRI scanner. Nonetheless, the high quality of their think-

aloud utterances immediately after leaving the scanner reduces

the potential impact of this limitation. Fourth, the reasoning

associated with answering MCQs may be quite different than

reasoning applied to seeing actual patients. Future work could

consider having participants view videotapes of patient–

physician encounters. Fifth, there may be differences with

thinking aloud and our modified method used in this study.

Our exploratory findings support emerging fMRI literature

(Jeste & Harris 2010; Riess 2010), and the fMRI data provide

evidence that think-aloud protocols can help educators with

assessing complex cognitive processes, such as diagnostic

reasoning. This is an encouraging finding, as think-aloud

protocols are far more feasible than conducting fMRI studies at

the current time. Ultimately, we believe the measurement

approach of combining neuroimaging and think-aloud data

with more traditional assessment measures offers great poten-

tial for advancing our understanding of diagnostic reasoning.

Declaration of interest: The authors report no conflicts of

interest. The authors alone are responsible for the content and

writing of the article.

This study was supported by grant funding from the

American Board of Internal Medicine Foundation.

The views expressed in this article are those of the authors

and do not reflect the views of the Department of Defense or

other federal agencies.

Notes on contributors

STEVEN J. DURNING, MD, Phd, is Professor of Medicine and Pathology

who directs an introduction to clinical reasoning course and co-directs the

Long Term Career Outcome Study at USU.

ANTHONY R. ARTINO, Jr., Phd is Associate Professor of Medicine and

Preventive Medicine & Biometrics. He has a PhD in educational psychology

and co-directs the Long Term Career Outcome Study at USU.

THOMAS J. BECKMAN, MD is Professor of Medicine and Medical Education

at the Mayo Clinic.

JOHN GRANER, Phd is an fMRI analyst at the Uniformed Services

University.

CEES VAN DER VLEUTEN, Phd is Professor of Education at Maastricht

University.

ERIC HOLMBOE is the Senior Vice President and Chief Medical Officer for the

American Board of Internal Medicine.

LAMBERT SCHUWIRTH, MD, Phd is Professor of Medical Education at

Flinders University.

Glossary of terms in medicaleducation

Think-aloud protocol: A technique whereby participants

state (think aloud) whatever they are thinking, feeling, or

doing as they complete a task.

References: Ericsson, K., & Simon, H. (1993). Protocol

Analysis: Verbal Reports as Data (2nd ed.). Boston: MIT

Press.

Ericsson KA, Charness N, Feltovich PJ, and Hoffman RR

(2006). The Cambridge Handbook of Expertise and Expert

Performance. New York: Cambridge University Press.

Functional magnetic resonance imaging or func-

tional MRI (fMRI): A procedure that measures brain

activity by detecting changes in blood flow due to energy

use by regions of neurons.

Reference: Huettel, S. A.; Song, A. W.; McCarthy, G. (2009),

Functional Magnetic Resonance Imaging (2 ed.),

Massachusetts: Sinauer.

References

Anderson JL. 2010. Cognitive psychology and its implications. 7th ed.

New York: World Publishers.

Baddeley AD. 1986. Working memory. Oxford: Clarendon Press.

Baddeley A. 1992. Working Memory. Science 255:556–559.

Beuachamp MS, Haxby JV, Jennings JE, DeYoe EA. 1999. An fMRI version

of the Farnsworth-munsell 100-hue test reveals multiple color-selective

areas in human ventral occipitotemporal cortex. Cerbral Cortex

9:257–263.

Brennan TA, Leape LL, Laird NM, et al. 1991. Incidence of adverse events

and negligence in hospitalized patients: Results of the Harvard Medical

Practice Study. N Eng J Med 324:370–376.

Cavanna AE, Trimble MR. 2006. The precuneus: A review of its functional

anatomy and behavioural correlates. Brain 129:564–583.

Cox RW. 1996. AFNI: Software for analysis and visualization of functional

magnetic resonance neuroimages. Comput Biomed Res 29:162–173.

Croskerry P. 2003. The importance of cognitive errors in diagnosis and

strategies to minimize them. Acad Med 78:775–780.

Desmond JE, Chen SH, DeRosa E, Pryor MR, Pfefferbaum A, Sullivan EV.

2003. Increased frontocerebellar activation in alcoholics during verbal

working memory: An fMRI study. Neuroimage 19:1510–1520.

Does the think-aloud protocol reflect thinking

725

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.

Ericsson KA, Simon HA. 1984. Protocol analysis: Verbal reports as data.

1984. Cambridge, MA: MIT Press.

Ericsson KA, Simon HA. 1987. Verbal reports on thinking. In: Faerch C,

Kasper G, editors. Introspection in second language research.

Clevedon, England: Multilingual Matters.

Ericsson KA, Charness N, Feltovich PJ, Hoffman RR. 2006. The Cambridge

Handbook of expertise and expert performance. New York: Cambridge

University Press.

Graybiel AM. 2005. The basal ganglia: Learning new tricks and loving it.

Curr Opin Neurobiol 15:638–644.

Higgs J, Jones MA. 2008. Clinical decision making and multiple problem

spaces. In: Higgs J, Jones MA, Loftus S, Christensen N, editors. Clinical

reasoning in the health professions. 3rd ed. Amsterdam: Elsevier.

Butterworth Heinemann.

Jeste DV, Harris JC. 2010. Wisdom – A neuroscience perspective. JAMA

304(14):1602–1603.

Klasen M, Zvyagintsev M, Weber R, Mathiak KA, Mathiak K. 2008. Think

aloud during fMRI: Neuronal correlates of subjective experience in

video games. Lect Notes Comput Sc 5294:132–138.

Koepp MJ, Gunn RN, Lawrence AD, Cunningham VJ, Dagher A, Jones T,

Brooks DJ, Bench CJ, Grasby PM. 1988. Evidence for striatal dopamine

release during a video game. Nature 393:266–268.

Melnick DE. 2009. Licensing examinations in North America: Is external

audit valuable? Med Teach 31:212–214.

Messick S. 1989. Meaning and values in test validation: The science and

ethics of assessment. Educ Res 18:5–11.

Newell A, Simon HA. 1972. Human problem solving. Englewood Cliffs, NJ:

Prentice Hall.

Russo JE, Johnson EJ, Stephens DL. 1989. The validity of verbal protocols.

Mem Cognition 17:759–769.

Rosen HJ, Ojemann JG, Ollinger JM, Petersen SE. 2000. Comparison of

brain activation during word retrieval done silently and aloud using

fMRI. Brain and Cognition 42:201–217.

Riess H. 2010. Empathy in medicine – A neurobiological perspective. JAMA

304(14):1604–1605.

Stuss D, Knight RT. 2002. The frontal lobes. New York: Oxford University

Press.

Van der Vleuten CPM, Newble DI. 1995. How can we test clinical

reasoning? The Lancet 345:1032–1034.

Appendix 1: fMRI data acquisition and analysis methodology

Data acquisition

Acquisitions were performed using an echo-planar imaging

(EPI) sequence of 40 contiguous sagittal slices per brain

volume (TR¼ 2000 ms, TE¼ 25 ms, flip angle¼ 60�, slice

thickness¼ 4.0 mm). In-plane resolution was 3.75� 3.75 mm

(64� 64 voxels). An fMRI task presentation of the 32 questions

was created using E-Prime software (Psychology Software

Tools, Inc.) and displayed via a goggle system (Nordic

NeuroLab Inc., Milwaukee, WI) while each participant was

in the fMRI scanner.

The questions were presented in a random order over four

fMRI acquisition runs. Each run contained eight questions. The

exact length of each run varied, depending on the amount of

time each subject took to progress through the reading and

answering phases of each question. Mean run length (�stand-

ard deviation) was 392� 62 s. Subjects underwent pre-training

to acquaint them with the method and layout of question

presentation and the correct use of the buttons corresponding

to answer options A to E. In addition to the functional imaging,

a high-resolution T1-weighted image was acquired for ana-

tomical reference (three dimensional GRE; TR¼ 6.6 ms,

TE¼ 2.5 ms, flip angle¼ 12 degrees). This anatomical T1

image consisted of 312 sagittal slices, a slice thickness of

0.6 mm and an in-plane resolution of 0.468� 0.468 mm

(512� 512 voxels).

Data analysis

Image preprocessing included typical fMRI procedures:

removal of the first three brain volumes (6 s) from each 4D

time series, slice-time correction, motion correction, registra-

tion to the T1 anatomical image, smoothing with an 8 mm full-

width-half-max Gaussian kernel, and conversion of voxel

values to voxel-wise percent-change-from-mean rather than

absolute intensity.

The ‘answer’ times varied from question to question and

were modeled with a gamma-variate function with variable

duration and variable amplitude (amplitude variation was

based on duration variation). The ‘think-aloud’ time was

constant at 14 s and was modeled with a non-variable gamma-

variate. The GLM analysis was used to determine the signifi-

cance of these model time-courses within each voxel, using

the ‘reading’ phase as a ‘baseline.’ Time-courses associated

with preprocessing motion correction parameters were also

included in the GLM regression analysis to regress out any

further image intensity changes due to motion.

S. J. Durning et al.

726

Med

Tea

ch D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y M

onas

h U

nive

rsity

on

09/0

6/13

For

pers

onal

use

onl

y.