potential speech features for cognitive load measurement
DESCRIPTION
Potential Speech Features for Cognitive Load MeasurementTRANSCRIPT
7/17/2019 Potential Speech Features for Cognitive Load Measurement
http://slidepdf.com/reader/full/potential-speech-features-for-cognitive-load-measurement 1/4
57
Potential Speech Features for Cognitive LoadMeasurement
M. Asif Khawaja
NICTA / EET UNSWSydney, Australia
+61 2 8374 5568
Natalie Ruiz
NICTA / CSE UNSWSydney, Australia
+61 2 8374 5570
Fang Chen
NICTAATP, Sydney, Australia
+61 2 8374 5555
ABSTRACT
Intelligent user interfaces with an awareness of a user’s
experienced level of cognitive load have the potential to change
the way output strategies are implemented and executed.
However, current methods of measuring cognitive load are
intrusive and unsuitable in real-time scenarios. Certain speech
features have been shown to change under high levels of load.
We present a dual-task speech based user study in which we
explore three speech features: pause length, pause frequency
and latency to response. These features are evaluated for their
diagnostic capacity. Pause length and latency to response areshown to be useful indicators of high load versus low load
speech.
Categories and Subject Descriptors
D.2.2 [Design Tools and Techniques]: User interfaces; H.5.2
[User Interfaces]: Input Devices and Strategies, Interaction
Styles, Voice I/O.
General Terms
Measurement, Performance, Design, Experimentation
Keywords
Speech features, Cognitive load.
1. INTRODUCTIONIn complex, data-intense situations, users can experience high
levels of cognitive load which can interfere with their ability to
complete a task at an optimum level of performance. Intelligent
user interfaces, which are aware of the increase in experienced
load of the user could, in fact, deploy output strategies that
alleviate this problem by modulating the pace and content of the
output or interaction. However, detecting changes in
experienced levels of cognitive load is not trivial. Certain
speech features have been shown to change under high levels of
load. In a speech-enabled interface, speech input is collected for
recognition purposes, it is therefore possible to analyse this
interactive voice data for the existence of feature patterns that
betray increases in cognitive load for that user.
In this paper, we focus on assessing speech features as potentialindices of cognitive load. To do this, it is necessary to first
identify and quantify the fluctuations of a number of speech
features as cognitive load increases. We present the design and
method of a user study which induces two controlled levels of
load while soliciting natural speech from each subject. We
analyse the results and summarise preliminary conclusions
related to our exploratory analysis of speech signal pauses and
task response latencies.
2. BACKGROUNDCognitive load refers to the amount of mental demand imposed
by a particular task, and has been associated with the limited
capacity of working memory and novel information [4, 5]. It is
derived from the semantic or representational complexity of the
task, among other factors. However, the same task can affect
different users in different ways, and can induce levels ofperceived cognitive load that vary from user to user. This is due
to a number of reasons, for example, level of domain or
interface expertise of the user, their age, mental or physical
impediments, etc. High cognitive load has a major impact on
users’ ability to learn from the task, and can severely impact
their performance, detracting from learning [5].
Cognitive load has been assessed through physiological,
performance and self-report measures [5], however, such
measures are intrusive or require a lot of equipment and
expertise. While they may be useful approaches in research
situations, they are often unsuitable for deployment in real-life
scenarios. Behavioural measures, such as frequency of
disfluencies or prosodic changes in speech are known to occur
as cognitive load increases and are much more amenable to
these circumstances. Such measures can also be implicit, as theybased on the analysis of data streams employed by the user as
they complete the task. These can be standardised and allow for
comparison across users [11].
In particular, linguistic features can be exacted from spoken or
written language. They are highly unobtrusive as the data can
be collected while the subject completes the tasks without them
being aware of it happening. Analysis can be carried out on the
content of the language (throughput, coherency etc.) or the
manner in which it is delivered (pitch, volume, articulation rate
etc.). Prosodic content can also be analysed, the number and
length of pauses and hesitations, as well as peak intonation
patterns may also be an indication of high load situations [8, 3].
Further, prior-art has shown significant variations in levels of
spoken disfluency, articulation rate and filler and pause rates in
users experiencing low versus high cognitive load. In other
studies, speech variations have been shown to occur when
subjects find it difficult to communicate with the system via
speech, hence entering an “error-spiral” of misrecognition.
Such a high-load scenario causes subjects to hyper-articulate,
which in turn causes changes in the speech signal [6]. At the
more semantic end of linguistic indices, the frequency of
occurrence of sentence fragments has been one of the symptoms
observed. Sentence fragments consist of incomplete syntactic
structures or ill formed sentences, also known as disfluencies,
which include self-correction and false starts [3]. At the
OzCHI 2007, 28-30 November 2007, Adelaide, Australia. Copyright the
author(s) and CHISIG. Additional copies are available at the ACM
Digital Library (http://portal.acm.org/dl.cfm) or can be ordered from
CHISIG([email protected])
OzCHI 2007 Proceedings, ISBN 978-1-59593-872-5
7/17/2019 Potential Speech Features for Cognitive Load Measurement
http://slidepdf.com/reader/full/potential-speech-features-for-cognitive-load-measurement 2/4
58
physical end of the spectrum, the signal characteristics of
speech such as energy, changes in pitch and fundamental
frequency have also been explored as possible candidates for
indicators of load. Direct comparisons between conditions of
low and high load have shown that speech rate, fundamental
speech frequency and speech energy could be used to
distinguish between different levels of experienced load [8].
3. EXPERIMENT DESIGN3.1 Dual Task ParadigmThe dual-task paradigm has been widely used in the field of
psychology to induce high levels of cognitive load. The subject
is required to perform two tasks at the same time. This becomes
a much more difficult task than either one on its own. Dual-task
performance is expected to degrade in both tasks, compared to
when performed separately. This is largely due to the limited
capacity of working memory as well as the load required to shift
attention from one task to the other. This latter effect is known
as interference. [7]
In high complexity, real-time scenarios, such situations are
likely to occur frequently, and users are often required to
manage two or more tasks at the same time. The dual-task
paradigm was chosen to help induce the high level of cognitiveload in our study.
3.2 HypothesesWe expected speech pauses to be a likely indicator of load in a
variety of aspects, since they are the main elements of features
already shown in the art to be symptomatic of high load, such as
sentence fragments (pause length), articulation rate (pause
frequency). We expected a significant increase across users in
the number of speech pauses in the high load task when
compared to the Low load (control) task. Pauses are indicative
of extra time-taken for problem solving and particularly in a
dual-task scenario, the time it takes to manage the limited
capacity of working memory as the subject works through the
tasks. We similarly expected the length of those pauses would
also increase, for all users. In addition, it was expected the more
difficult task would yield an increase in response latency.
3.3 MethodA reading and comprehension task was chosen as the control
task. Each task involved two subtasks:
(i) reading a text extract aloud,
(ii) answering some open-ended questions about the content of
that extract aloud.
Short extracts on general knowledge were prepared, such that
any expertise effect was avoided, which the subjects were to
commit to memory as best as they could while reading aloud.
The difficulty level of the extract was rated using the Lexile
Framework for Reading [1], which provides a standard for
defining text difficulty and reading measurement by examiningsemantic difficulty and syntactic complexity of the text [2]. The
ratings range from 200 to 1700 Lexile, reflecting the reading
level of a first grade student and a graduate student respectively.
The ratings for the prepared extracts are shown in Table 1.
The following comprehension questions were asked by the
experimenter at the completion of each reading:
• Give a short summary of the story in at least five full
sentences.
• What was the most interesting point in this story?
• Describe at least two other points highlighted in this story.
The experiment was to be conducted in two different sessions,
one for the dual-task and one for the control task. The dual-task
was aurally based and consisted of playing a series of random
two-digit numbers through a headset, softly in the background
at random intervals, while the subject was completing the
reading and comprehension task. The subjects were required tocount how many numbers they heard during both reading and
comprehension. A few seconds break was provided between the
reading and the comprehension subtasks.
Table 1: Load levels of selected stories
Task Load
Level
Lexile Rating Dual Task
Low 1300L No
High 1300L Yes
3.2 ProcedureThe subjects were asked to read the extracts aloud at their own
pace and their speech was recorded. They were then asked to
respond to the set of questions aloud and in full sentences,which were also recorded. They did not have freedom of
inspection of the extracts to answer the questions, as the reading
material was taken away after it was read. In the dual-task
condition, the subjects were required to answer the counting
questions as well. At the end of each reading and
comprehension, the subjects were asked to rate the difficulty of
reading these stories and answering the comprehension
questions on two 9-point scales, to allow us to verify whether
the perceived levels of load increased as designed.
Besides the speech data, we also collected galvanic skin
response (GSR) data by attaching the GSR biosensors to the
same subjects during the experiments conducted for this study.
Skin conductance is known to be directly proportional to the
memory stress and cognitive load [9, 10].
3.3 ParticipantsThe experiment was conducted in two sessions. Set 1 – Dual
Task (high load) involved 15 subjects (7 male and 8 female)
while there were 9 subjects (5 male and 4 female) in Set 2 –
Control Task (low load). No subjects were repeated in the sets
to avoid carry-over effects of knowing what the questions were
going to be. All subjects were random, remunerated, native
English speakers and were asked to complete reading
comprehension tasks. It was assumed that adults over 18 have
relatively similar reading and comprehension skills, hence
differences in reading ability would be negligible.
4. RESULTS
4.1 Subjective RatingsDifferent subjects had slightly different responses toward the
task difficulty. Few found the experiment challenging and were
desperate to finish it as quickly as possible. Others were quite
calm and relaxed through the task. The general consensus was
that high load (dual) task was the more difficult to handle,
anecdotally confirming the effectiveness of the method
employed to increase the experienced cognitive load.
For the low load task, the subjective ratings showed a similar
behavior as in the high load task; however there was no
statistical difference found in the average reported ratings.
7/17/2019 Potential Speech Features for Cognitive Load Measurement
http://slidepdf.com/reader/full/potential-speech-features-for-cognitive-load-measurement 3/4
59
Nevertheless, the dual-task paradigm is widely used and has
been effective in the past; the reasons for a lack of reported
differences are discussed in a later section.
5.5
6
6.5
7
7.5
D i f f i c u l t y L e v e
l
Low Load Task (No dual-task) High Load Task (Dual-task)
Average Difficulty Ratings
Reading Avg. Answering Avg. Task Averages
Figure 1: Subjective Ratings for Low/High Load
4.2 Lengths of PausesSince the comprehension sub-task provided data most like
natural speech, the analysis of pauses was carried out on this
data. Analysis of pauses was conducted for two different types:
silent pauses (speechless segments) and filled pauses (e.g.
Aah.., hmm.., umm.. etc). To the best of our knowledge, there is
no standard available to define the minimum duration of what
constituted a significant or deliberate pause. We defined the
minimum length pause to be 0.3s. Pauses shorter than 0.3s were
assumed to be inherent in natural speech.
We tested for differences in pause lengths of the low and high
load sets, an independent-sample 2-tailed t-test was conducted
at 0.05 level of significance for average silent pause lengths.
Average silent pause lengths for the high load set are
significantly higher than that for low load set (difference of
39.1%) with p=0.01, <0.05. The filled pause lengths and total
pause lengths showed very similar significant differences with
p=0.04 and p=0.009, respectively. This suggests a clear trend of
increased pause length, for both silent and filled pauses between
the Low and High Load sets, as hypothesized and illustrated inFigure 2.
Average Pause Lengths
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Low Load Task (No dual- task) High Load Task (Dual- task)
S e c o n d s
Silent Paus es Filled Paus es Total Paus es
Figure 2: Average Pause Lengths
4.3 Frequency of PausesThe number of pauses was expected to increase, as the subjects
took longer to answer comprehension questions for the more
difficult dual task. The analysis of the number of pauses was
carried out and included both silent pauses and filled pauses.
However, since each subject spoke for a different length of time
when answering the comprehension questions, it was necessary
to normalize the total frequency count with the total time
spoken by each subject. This data was used for our statistical
analysis. Figure 3 shows the increasing trend of pause
frequencies from the low load task to the high load task. To test
the significance of the difference in number of pauses between
low and high load sets, an independent-sample 2-tailed t-test
was conducted at 0.05 level of significance. Unexpectedly, no
significant difference was found between the low load task and
the high load task with respect to the number of pauses. Perhaps
a more severe change in cognitive load is necessary to induce a
significantly different result, however the trend is positive.
Average Pause Frequencies Per 30 Second
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Low Load Task (No dual- task) High Load Task (Dual- task)
N u m b e r o f P a u s e s
Silent Pauses Filled Paus es Total Paus es
Figure 3: Normalized Pause Frequencies
4.4 Response LatencyResponse latency is the time taken by the subjects to answer the
question, from the point they finish reading the question to their
first response for that question. It was expected that the
response latency would increase with increases in load level.
Figure 4 shows the increasing trend of the data as expected.
Average Response Latency
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Low Load Task (No dual-task) High Load Task (Dual-task)
S e c o n d s
Figure 4: Average Response Latency Times
To test the significance of difference in average response
latency times between the low and high load sets, we completed
an independent-sample 2-tailed t-test at 0.05 level of
significance. Average response latency times in the low load
task are significantly lower than that for the high load task
(difference of 52%) with p=0.01, <0.05. This suggests that
higher levels of cognitive load resulting in increased response
latency times, concurring with our hypothesis.
5. DISCUSSIONAnalyses of the comprehension speech data was performed for
selected features including pause lengths, pause frequency, and
response latency times. Average pause lengths increased, as did
pause frequencies and response latency times, the first and last
significantly. These differences are clearly due to the presence
of dual-task inducing higher cognitive load as in the high load
task. We remain optimistic about the potential of pause
frequencies, though they did not yield statistically significant
results, the increased trend still exists and concurs with our
hypothesis. It is possible that the level higher levels of load may
7/17/2019 Potential Speech Features for Cognitive Load Measurement
http://slidepdf.com/reader/full/potential-speech-features-for-cognitive-load-measurement 4/4
60
induce more severe symptoms of this kind. Additionally, it is
expected that not all users will be affected by the same
symptoms – some users may not exhibit any symptoms at all for
a particular feature. The lack of significance may be due to
individual differences being lost in measuring the averages
across users.
Analysis of the subjective ratings did not show significant
differences between the two sets of subjective ratings; this may
have been due to a lack of a point of reference for experienced
load across groups, since the experiment was a between-
subjects design. A within-subjects design would have allowed a
reference point for the two tasks.
Also, though the results apply strongly across a wide variety of
users, they are specific for this combination of tasks, in a dual-
task scenario. Different results may be achieved with a different
set of tasks, e.g. a verbal comprehension task combined with a
visuo-spatially oriented secondary task involving maps. In our
case, the results are based on the natural-speech comprehension
data and our secondary task was purely auditory. Using
Baddeley’s Modal Model of working memory, both could be
classified as tasks that would inhabit the Phonological Loop of
the working memory model, which deals with verbal and
linguistic tasks. This ensured a high level of interferencebetween tasks, and this, combined with the degree of novelty in
the stimulus, a high level of cognitive load. In the case of a
verbal/spatial dual-task combination, the level of induced load
may not have been as severe.
6. CONCLUSION AND FUTURE WORKThe richness of multimodal interactive data means that systems
could eventually use these features of behaviour to detect
cognitive load variations, in an implicit manner. The
identification of such indices could help build intelligent
interface systems that adapt to difficulties experienced by the
user in real-time, e.g. by regulating the pace, volume or format
of the output. Adapting the interface to each user every time can
ensure optimal user performance.
This user study has provided encouraging evidence for the use
of three different speech-based feature indicators of increased
load, namely pause length and frequencies in natural speech, as
well as latency to response. Though these features require
further validation, analysis and evaluation, they offer a
promising contribution to the set of potential implicit
interactive indices.
It seems a logical step to analyze ‘meta-interaction’ features
such as these, since the data is already collected for recognition
and interpretation purposes in intelligent user interface systems.
Automatic measurement of cognitive load via speech features is
possible with the use of machine learning algorithms that
characterize such changes in the speech signal and can be
integrated with more transparent indices such as the ones
presented here.
In terms of future work, we have not analyzed the GSR data
collected but it is expected that the data would show statistically
significant variations in the low load and high load tasks also.
In conclusion, the speech features we have analyzed offer some
potential for future deployment. We expect such individual and
composite modal features to form part of a greater multimodal
suite of index features acting in concert as robust indices of
cognitive load.
7. ACKNOWLEDGMENTSOur thanks go to the subjects who participated in the
experiments.
8. REFERENCES[1] The Lexile Framework for Reading; MetaMetrics Inc.;
http://www.Lexile.com; Last accessed: July 2007.
[2] Lennon C. and Burdick H.; The Lexile Framework as an
Approach for Reading Measurement and Success; A white
paper from The Lexile Framework for Reading, April
2004; http://www.Lexile.com; Last accessed: July 2007.
[3] A. Berthold, & A. Jameson, "Interpreting Symptoms of
Cognitive Load in Speech Input" In J. Kay (Ed.), UM99,User modeling: Proceedings of the Seventh International
Conference. Vienna: Springer Wien New York, pp. 235–
244, 1999.
[4] A. Baddeley, "Working Memory". Science, 1992. 255:
556-559.
[5] F. Paas, et. al., "Cognitive load measurement as a means to
advance cognitive load theory". Educational Psychologist,
2003, 38, 63-71.
[6] Oviatt, S. L., MacEachern, M. & Levow, G. Predicting
hyperarticulate speech during human-computer error
resolution, Speech Communication, 1998, vol. 24, 2, 1-23
[7] Kahnemann, D. 1973. Attention and effort. Prentice-Hall,
New Jersey.
[8] Kettebekov, S. Exploiting Prosodic Structuring of
Coverbal Gesticulation, ICMI04, October 1315, 2004,
State College, PA, USA
[9] ProComp Infinity Hardware Manual; Thought Technology
Ltd., http://www.thoughttechnology.com; page 30.
[10] Jacobs S. C., Friedman R., et al.; Use of Skin Conductance
Changes during Mental Stress Testing as an index of
Autonomic Arousal in Cardiovascular Research, The
American Heart Journal, 1994, Vol. 128 (1), No 6,
pp.1170-1177
[11] Hart, S.G. & Staveland, L.E. (1988). Development of
NASA-TLX (Task Load Index): results of empirical and
theoretical research. In P.A. Hancock & N. Meshkati
(Eds.), Human Mental Workload (pp 139-183).
Amsterdam: North-Holland