potential speech features for cognitive load measurement

7/17/2019 Potential Speech Features for Cognitive Load Measurement

http://slidepdf.com/reader/full/potential-speech-features-for-cognitive-load-measurement 1/4

57

Potential Speech Features for Cognitive LoadMeasurement

M. Asif Khawaja

NICTA / EET UNSWSydney, Australia

+61 2 8374 5568

[email protected]

Natalie Ruiz

NICTA / CSE UNSWSydney, Australia

+61 2 8374 5570

[email protected]

Fang Chen

NICTAATP, Sydney, Australia

+61 2 8374 5555

[email protected]

ABSTRACT

Intelligent user interfaces with an awareness of a user’s

experienced level of cognitive load have the potential to change

the way output strategies are implemented and executed.

However, current methods of measuring cognitive load are

intrusive and unsuitable in real-time scenarios. Certain speech

features have been shown to change under high levels of load.

We present a dual-task speech based user study in which we

explore three speech features: pause length, pause frequency

and latency to response. These features are evaluated for their

diagnostic capacity. Pause length and latency to response areshown to be useful indicators of high load versus low load

speech.

Categories and Subject Descriptors

D.2.2 [Design Tools and Techniques]: User interfaces; H.5.2

[User Interfaces]: Input Devices and Strategies, Interaction

Styles, Voice I/O.

General Terms

Measurement, Performance, Design, Experimentation

Keywords

Speech features, Cognitive load.

1. INTRODUCTIONIn complex, data-intense situations, users can experience high

levels of cognitive load which can interfere with their ability to

complete a task at an optimum level of performance. Intelligent

user interfaces, which are aware of the increase in experienced

load of the user could, in fact, deploy output strategies that

alleviate this problem by modulating the pace and content of the

output or interaction. However, detecting changes in

experienced levels of cognitive load is not trivial. Certain

speech features have been shown to change under high levels of

load. In a speech-enabled interface, speech input is collected for

recognition purposes, it is therefore possible to analyse this

interactive voice data for the existence of feature patterns that

betray increases in cognitive load for that user.

In this paper, we focus on assessing speech features as potentialindices of cognitive load. To do this, it is necessary to first

identify and quantify the fluctuations of a number of speech

features as cognitive load increases. We present the design and

method of a user study which induces two controlled levels of

load while soliciting natural speech from each subject. We

analyse the results and summarise preliminary conclusions

related to our exploratory analysis of speech signal pauses and

task response latencies.

2. BACKGROUNDCognitive load refers to the amount of mental demand imposed

by a particular task, and has been associated with the limited

capacity of working memory and novel information [4, 5]. It is

derived from the semantic or representational complexity of the

task, among other factors. However, the same task can affect

different users in different ways, and can induce levels ofperceived cognitive load that vary from user to user. This is due

to a number of reasons, for example, level of domain or

interface expertise of the user, their age, mental or physical

impediments, etc. High cognitive load has a major impact on

users’ ability to learn from the task, and can severely impact

their performance, detracting from learning [5].

Cognitive load has been assessed through physiological,

performance and self-report measures [5], however, such

measures are intrusive or require a lot of equipment and

expertise. While they may be useful approaches in research

situations, they are often unsuitable for deployment in real-life

scenarios. Behavioural measures, such as frequency of

disfluencies or prosodic changes in speech are known to occur

as cognitive load increases and are much more amenable to

these circumstances. Such measures can also be implicit, as theybased on the analysis of data streams employed by the user as

they complete the task. These can be standardised and allow for

comparison across users [11].

In particular, linguistic features can be exacted from spoken or

written language. They are highly unobtrusive as the data can

be collected while the subject completes the tasks without them

being aware of it happening. Analysis can be carried out on the

content of the language (throughput, coherency etc.) or the

manner in which it is delivered (pitch, volume, articulation rate

etc.). Prosodic content can also be analysed, the number and

length of pauses and hesitations, as well as peak intonation

patterns may also be an indication of high load situations [8, 3].

Further, prior-art has shown significant variations in levels of

spoken disfluency, articulation rate and filler and pause rates in

users experiencing low versus high cognitive load. In other

studies, speech variations have been shown to occur when

subjects find it difficult to communicate with the system via

speech, hence entering an “error-spiral” of misrecognition.

Such a high-load scenario causes subjects to hyper-articulate,

which in turn causes changes in the speech signal [6]. At the

more semantic end of linguistic indices, the frequency of

occurrence of sentence fragments has been one of the symptoms

observed. Sentence fragments consist of incomplete syntactic

structures or ill formed sentences, also known as disfluencies,

which include self-correction and false starts [3]. At the

OzCHI 2007, 28-30 November 2007, Adelaide, Australia. Copyright the

author(s) and CHISIG. Additional copies are available at the ACM

Digital Library (http://portal.acm.org/dl.cfm) or can be ordered from

CHISIG([email protected])

OzCHI 2007 Proceedings, ISBN 978-1-59593-872-5



58

physical end of the spectrum, the signal characteristics of

speech such as energy, changes in pitch and fundamental

frequency have also been explored as possible candidates for

indicators of load. Direct comparisons between conditions of

low and high load have shown that speech rate, fundamental

speech frequency and speech energy could be used to

distinguish between different levels of experienced load [8].

3. EXPERIMENT DESIGN3.1 Dual Task ParadigmThe dual-task paradigm has been widely used in the field of

psychology to induce high levels of cognitive load. The subject

is required to perform two tasks at the same time. This becomes

a much more difficult task than either one on its own. Dual-task

performance is expected to degrade in both tasks, compared to

when performed separately. This is largely due to the limited

capacity of working memory as well as the load required to shift

attention from one task to the other. This latter effect is known

as interference. [7]

In high complexity, real-time scenarios, such situations are

likely to occur frequently, and users are often required to

manage two or more tasks at the same time. The dual-task

paradigm was chosen to help induce the high level of cognitiveload in our study.

3.2 HypothesesWe expected speech pauses to be a likely indicator of load in a

variety of aspects, since they are the main elements of features

already shown in the art to be symptomatic of high load, such as

sentence fragments (pause length), articulation rate (pause

frequency). We expected a significant increase across users in

the number of speech pauses in the high load task when

compared to the Low load (control) task. Pauses are indicative

of extra time-taken for problem solving and particularly in a

dual-task scenario, the time it takes to manage the limited

capacity of working memory as the subject works through the

tasks. We similarly expected the length of those pauses would

also increase, for all users. In addition, it was expected the more

difficult task would yield an increase in response latency.

3.3 MethodA reading and comprehension task was chosen as the control

task. Each task involved two subtasks:

(i) reading a text extract aloud,

(ii) answering some open-ended questions about the content of

that extract aloud.

Short extracts on general knowledge were prepared, such that

any expertise effect was avoided, which the subjects were to

commit to memory as best as they could while reading aloud.

The difficulty level of the extract was rated using the Lexile

Framework for Reading [1], which provides a standard for

defining text difficulty and reading measurement by examiningsemantic difficulty and syntactic complexity of the text [2]. The

ratings range from 200 to 1700 Lexile, reflecting the reading

level of a first grade student and a graduate student respectively.

The ratings for the prepared extracts are shown in Table 1.

The following comprehension questions were asked by the

experimenter at the completion of each reading:

• Give a short summary of the story in at least five full

sentences.

• What was the most interesting point in this story?

• Describe at least two other points highlighted in this story.

The experiment was to be conducted in two different sessions,

one for the dual-task and one for the control task. The dual-task

was aurally based and consisted of playing a series of random

two-digit numbers through a headset, softly in the background

at random intervals, while the subject was completing the

reading and comprehension task. The subjects were required tocount how many numbers they heard during both reading and

comprehension. A few seconds break was provided between the

reading and the comprehension subtasks.

Table 1: Load levels of selected stories

Task Load

Level

Lexile Rating Dual Task

Low 1300L No

High 1300L Yes

3.2 ProcedureThe subjects were asked to read the extracts aloud at their own

pace and their speech was recorded. They were then asked to

respond to the set of questions aloud and in full sentences,which were also recorded. They did not have freedom of

inspection of the extracts to answer the questions, as the reading

material was taken away after it was read. In the dual-task

condition, the subjects were required to answer the counting

questions as well. At the end of each reading and

comprehension, the subjects were asked to rate the difficulty of

reading these stories and answering the comprehension

questions on two 9-point scales, to allow us to verify whether

the perceived levels of load increased as designed.

Besides the speech data, we also collected galvanic skin

response (GSR) data by attaching the GSR biosensors to the

same subjects during the experiments conducted for this study.

Skin conductance is known to be directly proportional to the

memory stress and cognitive load [9, 10].

3.3 ParticipantsThe experiment was conducted in two sessions. Set 1 – Dual

Task (high load) involved 15 subjects (7 male and 8 female)

while there were 9 subjects (5 male and 4 female) in Set 2 –

Control Task (low load). No subjects were repeated in the sets

to avoid carry-over effects of knowing what the questions were

going to be. All subjects were random, remunerated, native

English speakers and were asked to complete reading

comprehension tasks. It was assumed that adults over 18 have

relatively similar reading and comprehension skills, hence

differences in reading ability would be negligible.

4. RESULTS

4.1 Subjective RatingsDifferent subjects had slightly different responses toward the

task difficulty. Few found the experiment challenging and were

desperate to finish it as quickly as possible. Others were quite

calm and relaxed through the task. The general consensus was

that high load (dual) task was the more difficult to handle,

anecdotally confirming the effectiveness of the method

employed to increase the experienced cognitive load.

For the low load task, the subjective ratings showed a similar

behavior as in the high load task; however there was no

statistical difference found in the average reported ratings.



59

Nevertheless, the dual-task paradigm is widely used and has

been effective in the past; the reasons for a lack of reported

differences are discussed in a later section.

5.5

6

6.5

7

7.5

D i f f i c u l t y L e v e

l

Low Load Task (No dual-task) High Load Task (Dual-task)

Average Difficulty Ratings

Reading Avg. Answering Avg. Task Averages

Figure 1: Subjective Ratings for Low/High Load

4.2 Lengths of PausesSince the comprehension sub-task provided data most like

natural speech, the analysis of pauses was carried out on this

data. Analysis of pauses was conducted for two different types:

silent pauses (speechless segments) and filled pauses (e.g.

Aah.., hmm.., umm.. etc). To the best of our knowledge, there is

no standard available to define the minimum duration of what

constituted a significant or deliberate pause. We defined the

minimum length pause to be 0.3s. Pauses shorter than 0.3s were

assumed to be inherent in natural speech.

We tested for differences in pause lengths of the low and high

load sets, an independent-sample 2-tailed t-test was conducted

at 0.05 level of significance for average silent pause lengths.

Average silent pause lengths for the high load set are

significantly higher than that for low load set (difference of

39.1%) with p=0.01, <0.05. The filled pause lengths and total

pause lengths showed very similar significant differences with

p=0.04 and p=0.009, respectively. This suggests a clear trend of

increased pause length, for both silent and filled pauses between

the Low and High Load sets, as hypothesized and illustrated inFigure 2.

Average Pause Lengths

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Low Load Task (No dual- task) High Load Task (Dual- task)

S e c o n d s

Silent Paus es Filled Paus es Total Paus es

Figure 2: Average Pause Lengths

4.3 Frequency of PausesThe number of pauses was expected to increase, as the subjects

took longer to answer comprehension questions for the more

difficult dual task. The analysis of the number of pauses was

carried out and included both silent pauses and filled pauses.

However, since each subject spoke for a different length of time

when answering the comprehension questions, it was necessary

to normalize the total frequency count with the total time

spoken by each subject. This data was used for our statistical

analysis. Figure 3 shows the increasing trend of pause

frequencies from the low load task to the high load task. To test

the significance of the difference in number of pauses between

low and high load sets, an independent-sample 2-tailed t-test

was conducted at 0.05 level of significance. Unexpectedly, no

significant difference was found between the low load task and

the high load task with respect to the number of pauses. Perhaps

a more severe change in cognitive load is necessary to induce a

significantly different result, however the trend is positive.

Average Pause Frequencies Per 30 Second

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

Low Load Task (No dual- task) High Load Task (Dual- task)

N u m b e r o f P a u s e s

Silent Pauses Filled Paus es Total Paus es

Figure 3: Normalized Pause Frequencies

4.4 Response LatencyResponse latency is the time taken by the subjects to answer the

question, from the point they finish reading the question to their

first response for that question. It was expected that the

response latency would increase with increases in load level.

Figure 4 shows the increasing trend of the data as expected.

Average Response Latency

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

Low Load Task (No dual-task) High Load Task (Dual-task)

S e c o n d s

Figure 4: Average Response Latency Times

To test the significance of difference in average response

latency times between the low and high load sets, we completed

an independent-sample 2-tailed t-test at 0.05 level of

significance. Average response latency times in the low load

task are significantly lower than that for the high load task

(difference of 52%) with p=0.01, <0.05. This suggests that

higher levels of cognitive load resulting in increased response

latency times, concurring with our hypothesis.

5. DISCUSSIONAnalyses of the comprehension speech data was performed for

selected features including pause lengths, pause frequency, and

response latency times. Average pause lengths increased, as did

pause frequencies and response latency times, the first and last

significantly. These differences are clearly due to the presence

of dual-task inducing higher cognitive load as in the high load

task. We remain optimistic about the potential of pause

frequencies, though they did not yield statistically significant

results, the increased trend still exists and concurs with our

hypothesis. It is possible that the level higher levels of load may



60

induce more severe symptoms of this kind. Additionally, it is

expected that not all users will be affected by the same

symptoms – some users may not exhibit any symptoms at all for

a particular feature. The lack of significance may be due to

individual differences being lost in measuring the averages

across users.

Analysis of the subjective ratings did not show significant

differences between the two sets of subjective ratings; this may

have been due to a lack of a point of reference for experienced

load across groups, since the experiment was a between-

subjects design. A within-subjects design would have allowed a

reference point for the two tasks.

Also, though the results apply strongly across a wide variety of

users, they are specific for this combination of tasks, in a dual-

task scenario. Different results may be achieved with a different

set of tasks, e.g. a verbal comprehension task combined with a

visuo-spatially oriented secondary task involving maps. In our

case, the results are based on the natural-speech comprehension

data and our secondary task was purely auditory. Using

Baddeley’s Modal Model of working memory, both could be

classified as tasks that would inhabit the Phonological Loop of

the working memory model, which deals with verbal and

linguistic tasks. This ensured a high level of interferencebetween tasks, and this, combined with the degree of novelty in

the stimulus, a high level of cognitive load. In the case of a

verbal/spatial dual-task combination, the level of induced load

may not have been as severe.

6. CONCLUSION AND FUTURE WORKThe richness of multimodal interactive data means that systems

could eventually use these features of behaviour to detect

cognitive load variations, in an implicit manner. The

identification of such indices could help build intelligent

interface systems that adapt to difficulties experienced by the

user in real-time, e.g. by regulating the pace, volume or format

of the output. Adapting the interface to each user every time can

ensure optimal user performance.

This user study has provided encouraging evidence for the use

of three different speech-based feature indicators of increased

load, namely pause length and frequencies in natural speech, as

well as latency to response. Though these features require

further validation, analysis and evaluation, they offer a

promising contribution to the set of potential implicit

interactive indices.

It seems a logical step to analyze ‘meta-interaction’ features

such as these, since the data is already collected for recognition

and interpretation purposes in intelligent user interface systems.

Automatic measurement of cognitive load via speech features is

possible with the use of machine learning algorithms that

characterize such changes in the speech signal and can be

integrated with more transparent indices such as the ones

presented here.

In terms of future work, we have not analyzed the GSR data

collected but it is expected that the data would show statistically

significant variations in the low load and high load tasks also.

In conclusion, the speech features we have analyzed offer some

potential for future deployment. We expect such individual and

composite modal features to form part of a greater multimodal

suite of index features acting in concert as robust indices of

cognitive load.

7. ACKNOWLEDGMENTSOur thanks go to the subjects who participated in the

experiments.

8. REFERENCES[1] The Lexile Framework for Reading; MetaMetrics Inc.;

http://www.Lexile.com; Last accessed: July 2007.

[2] Lennon C. and Burdick H.; The Lexile Framework as an

Approach for Reading Measurement and Success; A white

paper from The Lexile Framework for Reading, April

2004; http://www.Lexile.com; Last accessed: July 2007.

[3] A. Berthold, & A. Jameson, "Interpreting Symptoms of

Cognitive Load in Speech Input" In J. Kay (Ed.), UM99,User modeling: Proceedings of the Seventh International

Conference. Vienna: Springer Wien New York, pp. 235–

244, 1999.

[4] A. Baddeley, "Working Memory". Science, 1992. 255:

556-559.

[5] F. Paas, et. al., "Cognitive load measurement as a means to

advance cognitive load theory". Educational Psychologist,

2003, 38, 63-71.

[6] Oviatt, S. L., MacEachern, M. & Levow, G. Predicting

hyperarticulate speech during human-computer error

resolution, Speech Communication, 1998, vol. 24, 2, 1-23

[7] Kahnemann, D. 1973. Attention and effort. Prentice-Hall,

New Jersey.

[8] Kettebekov, S. Exploiting Prosodic Structuring of

Coverbal Gesticulation, ICMI04, October 1315, 2004,

State College, PA, USA

[9] ProComp Infinity Hardware Manual; Thought Technology

Ltd., http://www.thoughttechnology.com; page 30.

[10] Jacobs S. C., Friedman R., et al.; Use of Skin Conductance

Changes during Mental Stress Testing as an index of

Autonomic Arousal in Cardiovascular Research, The

American Heart Journal, 1994, Vol. 128 (1), No 6,

pp.1170-1177

[11] Hart, S.G. & Staveland, L.E. (1988). Development of

NASA-TLX (Task Load Index): results of empirical and

theoretical research. In P.A. Hancock & N. Meshkati

(Eds.), Human Mental Workload (pp 139-183).

Amsterdam: North-Holland

potential speech features for cognitive load measurement

Documents