recognizing student emotions and attitudes on the basisof...

2

3

4

5

67

10

1112131415161718192021

2223

24

25

262728

SPECOM 1505 No. of Pages 33, Model 3+

18 October 2005 Disk UsedARTICLE IN PRESS

Speech Communication xxx (2005) xxx–xxx

www.elsevier.com/locate/specom

ROOFRecognizing student emotions and attitudes on the basis

of utterances in spoken tutoring dialogues withboth human and computer tutors

Diane J. Litman a,*, Kate Forbes-Riley b

a University of Pittsburgh, Department of Computer Science and Learning Research and Development Center, Pittsburgh, PA 15260, USAb University of Pittsburgh, Learning Research and Development Center, Pittsburgh, PA 15260, USA

Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September 2005
P
RRECTEDAbstract

While human tutors respond to both what a student says and to how the student says it, most tutorial dialogue systemscannot detect the student emotions and attitudes underlying an utterance. We present an empirical study investigating thefeasibility of recognizing student state in two corpora of spoken tutoring dialogues, one with a human tutor, and one witha computer tutor. We first annotate student turns for negative, neutral and positive student states in both corpora. We thenautomatically extract acoustic–prosodic features from the student speech, and lexical items from the transcribed or recog-nized speech. We compare the results of machine learning experiments using these features alone, in combination, and withstudent and task dependent features, to predict student states. We also compare our results across human–human andhuman–computer spoken tutoring dialogues. Our results show significant improvements in prediction accuracy over rel-evant baselines, and provide a first step towards enhancing our intelligent tutoring spoken dialogue system to automati-cally recognize and adapt to student states.� 2005 Elsevier B.V. All rights reserved.

Keywords: Emotional speech; Predicting user state via machine learning; Prosody; Empirical study relevant to adaptive spoken dialoguesystems; Tutorial dialogue systems

29303132

33

CO1. Introduction

This paper investigates the automatic recognitionof student emotions and attitudes in both human–human and human–computer spoken tutoring dia-

UN

3435363738

0167-6393/$ - see front matter � 2005 Elsevier B.V. All rights reserved

doi:10.1016/j.specom.2005.09.008

* Corresponding author.E-mail addresses: [email protected] (D.J. Litman), for-

[email protected] (K. Forbes-Riley).

logues, on the basis of acoustic–prosodic and lexicalinformation extractable from utterances. In recentyears, the development of computational tutorialdialogue systems has become more and more preva-lent (Aleven and Rose, 2003; Rose and Freedman,2000; Rose and Aleven, 2002), as one method ofattempting to close the current performance gap be-tween human and computer tutors; recent experi-ments with such systems (e.g., Graesser et al.,2001b) are starting to yield promising empirical

.

mailto:[email protected]



T

39404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384

858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130

2 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx–xxx



UNCORREC

results. Motivated by connections between learningand student emotional state (Coles, 1999; Izard,1984; Masters et al., 1979; Nasby and Yando,1982; Potts et al., 1986; Seipp, 1991), another pro-posed method for closing the performance gap withhuman tutors has been to incorporate affective rea-

soning into computer tutoring systems, indepen-dently of whether or not the tutor is dialogue-based (Conati et al., 2003a; Kort et al., 2001; Bhattet al., 2004). Recently, some preliminary results withcomputer tutors have been presented to support thisline of research. Aist et al. (2002) have shown thatadding human-provided emotional scaffolding toan automated reading tutor increases student persis-tence, while Craig and Graesser (2003) have found asignificant relationship between students� confusionand learning during interactions with a mixed initia-tive dialogue tutoring system.1 Our long-term goalis to merge these lines of dialogue and affectivetutoring research, by enhancing our intelligenttutoring spoken dialogue system to automaticallyrecognize and adapt to student emotions and atti-tudes, and to investigate whether this improveslearning and other measures of performance. Thedevelopment of the adaptation component requiresaccurate emotion recognition; this paper presentsresults regarding this first step of our larger agenda:building an emotion recognition component.

Currently, most intelligent tutoring dialogue sys-tems do not attempt to recognize student emotionsand attitudes, and furthermore are text-based (Ale-ven et al., 2001; Evens et al., 2001; VanLehn et al.,2002; Zinn et al., 2002), which may limit their suc-cess at emotion prediction. Speech supplies a richsource of information about a speaker�s emotionalstate, and research in the area of emotional speechhas already shown that acoustic and prosodic fea-tures can be extracted from the speech signal andused to develop predictive models of emotions(Cowie et al., 2001; ten Bosch, 2003; Pantic andRothkrantz, 2003; Scherer, 2003). Much of this re-search has used databases of speech read by actorsor native speakers as training data (often withsemantically neutral content) (Oudeyer, 2002; Pol-zin and Waibel, 1998; Liscombe et al., 2003).Although analyses of the acoustic–prosodic fea-

131132133134135136

1 We have also found a correlation between the ratio ofnegative/neutral student states and learning gains in our intelli-gent tutoring spoken dialogue data (to be described below),although these results are very preliminary.

EDPROOF

tures associated with acted archetypal emotionssupport some correlations between specific featuresand emotions (e.g., lower average pitch and speak-ing rate for ‘‘sad’’ speech (ten Bosch, 2003)), theseresults generally transfer poorly to real applications(Cowie and Cornelius, 2003; Batliner et al., 2003).As a result, recent work motivated by spoken dia-logue applications has started to use naturallyoccurring speech to train emotion predictors (Sha-fran et al., 2003; Batliner et al., 2003; Narayanan,2002; Ang et al., 2002; Lee et al., 2002; Litmanet al., 2001; Batliner et al., 2000; Lee et al., 2001;Devillers et al., 2003). However, within emotion re-search using naturally occurring data, both therange of emotions presented and the features thatcorrelate with them have varied depending on theapplication domain (cf. Shafran et al., 2003;Narayanan, 2002; Ang et al., 2002; Lee et al.,2002; Batliner et al., 2000; Devillers et al., 2003).Thus, more empirical work is needed to explorewhether and how the use of similar techniquescan be effectively used to model student states inspoken dialogue tutoring systems. In addition, pastresearch using naturally occurring speech has stud-ied only human–human (Devillers et al., 2003), hu-man–computer (Shafran et al., 2003; Lee et al.,2001; Lee et al., 2002; Narayanan, 2002; Anget al., 2002), or wizard-of-oz (Batliner et al.,2000; Batliner et al., 2003; Narayanan, 2002) dia-logue data. Just as previous work has demonstratedthat results based on acted or read speech transferpoorly to spontaneous speech, more empirical workis needed to explore whether and how resultsregarding emotion prediction transfer across differ-ent types of naturally occurring spoken dialoguedata, i.e. spoken dialogues between humans versusspoken dialogues between humans and computers,and/or spoken dialogues from different applicationdomains.

In this paper, we examine the relative utility ofthe acoustic–prosodic and lexical information instudent utterances, both with and without studentand task dependent information, for recognizingstudent emotions and attitudes in spoken tutoringdialogues; we also examine the impact of using hu-man transcriptions versus noisier system output forobtaining such information. Our methodologybuilds on and generalizes the results of prior re-search from the area of spoken dialogue, whileapplying them to the new domain of naturallyoccurring tutoring dialogues (in the domain of qual-itative physics). Our work is also novel in replicating

T

137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187

188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237

D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx–xxx 3



UNCORREC

our analyses across two comparable spoken dia-logue corpora: one with a computer tutor, and theother with a human tutor performing the same taskas our computer system. Although these corporawere collected under comparable experimental con-ditions, they differ with respect to many characteris-tics, such as utterance length and student initiative.Given the current limitations of both speech andnatural language processing technologies, computertutors are far less flexible than human tutors, andalso make more errors. The use of human tutorsthus represents an ‘‘ideal’’ computer system, andthereby provides a benchmark for estimating theperformance of our emotion recognition methods,at least with respect to speech and natural languageprocessing performance.

In our experiments, we first annotate studentturns in both of our spoken dialogue tutoring cor-pora for negative, neutral, and positive emotionsand attitudes. We then create two datasets for eachcorpus: an ‘‘Agreed’’ dataset containing only thosestudent turns whose annotations were originallyagreed on by the annotators, and a ‘‘Consensus’’dataset containing all annotated student turns,where original disagreements were given a consen-sus label. These datasets are summarized in Table4. We then automatically extract acoustic–prosodicfeatures from the speech signal of our annotatedstudent turns, and lexical items from the transcribedor recognized speech, and perform a variety of ma-chine learning experiments to predict our emotioncategorizations using different feature set combina-tions. Overall, our results show that by using acous-tic–prosodic features alone, or in combination with‘‘identifier’’ features identifying specific subjects andtutoring problems, or in combination with lexicalinformation, we can significantly improve overbaseline (majority class) performance figures foremotion prediction. Our highest prediction accura-cies are obtained by combining multiple featuretypes and by predicting only those annotated stu-dent turns that both annotators agreed on. Table15 summarizes these results in our human–humancorpus, and Table 19 summarize these results inour human–computer corpus. However, simplermodels containing only a subset of features (or fea-ture types) work comparably in many experiments,and these simpler models often have the advantagein terms of ease of implementation and/or do-main-independence. While many of our observa-tions generalized across the human–human and

EDPROOF

human–computer dialogues, we also find interestingdifferences between recognizing emotion in our twocorpora, and also as compared to prior studies inother domains. In general, lexical features yieldedhigher predictive utility than acoustic–prosodic fea-tures. Within acoustic–prosodic features, there wasa trend for temporal features to have the highestpredictive utility, followed by energy features andlastly, pitch features. However, the usefulness ofacoustic–prosodic features varied across experi-ments and corpora; indeed, across prior researchas a whole, the usefulness of particular acoustic–prosodic features appears to be often domain-dependent. Similarly, ‘‘identifier’’ features, whoseuse is limited to domains such as ours where thereis a limited problem set and students reuse the tutor-ing system repeatedly, were found to have higherpredictive utility in our human–computer corpusas compared to our human–human corpus. Insum, our recognition results provide an empiricalbasis for the next phase of our research, which willbe to enhance our spoken dialogue tutoring systemto automatically recognize and ultimately to adaptto student states.

Section 2 describes ITSPOKE, our intelligenttutoring spoken dialogue system and the corpus itproduces, as well as a human–human spoken tutor-ing corpus that corresponds to the human–com-puter corpus produced by ITSPOKE. Section 3describes our annotation scheme for manuallylabeling student emotions and attitudes, and evalu-ates inter-annotator agreement when this scheme isused to annotate student states in dialogues fromboth our human–human and human–computercorpora. Section 4 discusses how acoustic and pro-sodic features available in real-time to ITSPOKEare computed from our dialogues. Section 5 thenpresents our machine learning experiments in auto-matic emotion recognition, analyzing the predictiveperformance of acoustic–prosodic features alone orin combination, both with and without subject andtask-dependent information. Section 6 investigatesthe impact of both adding a lexical feature repre-senting the transcription of the student turn, andfor the human–computer dialogues, using the noisyoutput of the speech recognizer rather than the ac-tual transcription. Finally, Section 7 discusses re-lated research, while Section 8 summarizes ourresults and describes our current and futuredirections.

T

238

239

240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279

280281282283284285286287288289

290291

292293294295296297298299300301302303304305306307308309310311312313314315316

317

318

319




NCORREC

2. Spoken dialogue tutoring corpora

2.1. Common aspects of the corpora

Our data for this paper come from spoken inter-actions between student and tutor, through whichstudents learn to solve qualitative physics problems,i.e. thought-provoking ‘‘explain’’ or ‘‘why’’ typephysics problems that can be answered withoutdoing any mathematics. We have collected two cor-pora of these spoken tutoring dialogues, which aredistinguished according to whether the tutor is a hu-man or a computer.

In these spoken tutoring corpora, dialogue inter-action between student and tutor is mediated via aweb interface, supplemented with a high-qualityaudio link. An example screenshot of this web inter-face, generated during an interaction between a stu-dent and the computer tutor, is shown in Fig. 1. Thequalitative physics problem (problem 58) is shown inthe upper right box. The student begins by typing anessay answer to this problem in the middle right box.When finished with the essay, the student clicks the‘‘SUBMIT’’ button.2 The tutor then analyzes the es-say and engages the student in a spoken natural lan-guage dialogue to provide feedback and correctmisconceptions in the essay, and to elicit more com-plete explanations. The middle left box in Fig. 1 isused during human–computer tutoring to recordthe dialogue history. This box remains empty duringhuman–human tutoring, because both the studentand tutor utterances would require manual tran-scription before they could be displayed. In the hu-man–computer tutoring, in contrast, the speechrecognition and speech synthesis components ofthe computer tutor can be used to provide the ‘‘tran-scriptions.’’ After the dialogue between tutor andstudent is completed, the student revises the essay,thereby ending the tutoring for that physics problemor causing another round of tutoring/essay revision.

The experimental procedure for collecting bothour spoken tutoring corpora is as follows3: (1) stu-dents are given a pre-test measuring their knowledgeof physics, (2) students are asked to read through a

U 320321

2 The ‘‘Tell Tutor’’ box is used for typed student login andlogout.3 Our spoken tutoring corpora were collected as part of a wider

evaluation comparing student learning across speech-based andtext-based human–human and human–computer tutoring condi-tions (Litman et al., 2004).

EDPROOF

small document of background material,4 (3) stu-dents use the web and voice interface to workthrough a set of up to 10 training physics problemswith the (human or computer) tutor, and (4) stu-dents are given a post-test that is similar to thepre-test. The experiment typically takes no morethan 7 h per student, and is performed in 1–2 ses-sions. Students are University of Pittsburgh studentswho have never taken a college level physics course,and who are native speakers of American English.

2.2. The human–human spoken dialogue tutoringcorpus

Our human–human spoken dialogue tutoringcorpus contains 128 transcribed dialogues (physicsproblems) from 14 different students, collected fromFall 2002–Fall 2003. One human tutor participated.The student and the human tutor were separated bya partition, and spoke to each other through head-mounted microphones. Each participant�s speechwas digitally recorded on a separate channel. Tran-scription and turn-segmentation of the student andtutor speech were then done by a paid transcriber.The transcriber added a turn boundary when: (1)the speaker stopped speaking and the other partyin the dialogue began to speak, (2) the speakerasked a question and stopped speaking to wait foran answer, (3) the other party in the dialogue inter-rupted the speaker and the speaker paused to allowthe other party to speak.

An emotion-annotated (Section 3) excerpt fromour human–human tutoring corpus is shown inFig. 2. In the human–human corpus, interruptionsand overlapping speech are common; turns endingin ‘‘-’’ (as in TUTOR6, Fig. 2) indicate when speechoverlaps with the following turn, and other punctu-ation has been added to the transcriptions forreadability.

2.3. The human–computer spoken dialogue

tutoring corpus

Our human–computer spoken dialogue tutoringcorpus contains 100 dialogues (physics problems)from 20 students, collected from Fall 2003–Spring

4 In the computer tutoring experiment, the pre-test was movedto after the background reading, to allow us to measure learninggains caused by the experimental manipulation without confusingthem with gains caused by the background reading.

UNCORRECTEDPROOF

322323

324325

Fig. 2. Annotated excerpt from human–human spoken tutoring corpus.

Fig. 1. Screenshot during human–computer spoken tutoring dialogue.




2004. Our ‘‘computer tutor’’ is called ITSPOKE(Intelligent Tutoring SPOKEn dialogue system)

(Litman and Silliman, 2004). ITSPOKE uses as its‘‘back-end’’ the text-based Why2-Atlas dialogue

TEDPROOF

326327328329330331332333334335336337338339340341342343

344345346347348349350351352353354355356357358359360361

Fig. 3. Annotated excerpt from human–computer spoken tutoring corpus.




NCORRECtutoring system (VanLehn et al., 2002), which han-

dles syntactic and semantic analysis (Rose, 2000),discourse and domain level processing (Jordan andVanLehn, 2002; Jordan et al., 2003), and finite-statedialogue management (Rose et al., 2001).

To analyze the typed student essay, the Why2-Atlas back-end first parses the student essay intopropositional representations, in order to find usefuldialogue topics. It uses three different approaches(symbolic, statistical and hybrid) competitively tocreate a representation for each sentence, then re-solves temporal and nominal anaphora and con-structs proofs using abductive reasoning (Jordanet al., 2004).

During the subsequent dialogue, student speechis digitally recorded from head-mounted micro-phone input. Barge-ins and overlaps are not cur-rently permitted.5 The student speech is sent to the
U 362
363364365366367

5 Although not yet evaluated, our next version of ITSPOKEsupports barge-in, and thus allows the student to interruptITSPOKE when it is speaking, e.g., when it is giving a longexplanation.

Sphinx2 speech recognizer (Huang et al., 1993),whose stochastic language models have a vocabu-lary of 1240 words and are trained with 7720 stu-dent utterances from evaluations of Why2-Atlasand from pilot studies of ITSPOKE. Transcription(speech recognition) and turn-segmentation is doneautomatically in ITSPOKE. However, becausespeech recognition is imperfect, the human–com-puter data is also manually transcribed, for compar-ison. Sphinx2�s most probable ‘‘transcription’’(recognition output) is sent to the Why2-Atlasback-end for natural language understanding.

The dialogue is managed by a finite-state dialoguemanager, where nodes correspond to tutor turns,and arcs to student turns. Why2-Atlas� natural lan-guage understanding (NLU) component associatesa semantic grammar with each tutor question (i.e.,with each node in the dialogue finite-state machine);grammars across questions may share rules. The cat-egories in the grammar correspond to the expectedresponses for the question (i.e., to the arcs exitingthe question node in the finite-state machine), andrepresent both correct answers and typical studentmisconceptions (VanLehn et al., 2002). Given a stu-

T

368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411

412

413

414

415416417418419420421422423424425426427

428429

432

433434435

8 In the rest of this paper, we will use the term ‘‘emotion’’loosely, to cover both affects and attitudes that can impactstudent learning. Although some argue that ‘‘emotion’’ should bedistinguished from ‘‘attitude’’, some speech researchers havefound that the narrow sense of ‘‘emotion’’ is too restrictivebecause it excludes states in speech where emotion is present butnot full-blown, including arousal and attitude (Cowie andCornelius, 2003). Some tutoring researchers have also found ituseful to take a combined view of affect and attitude (Bhatt et al.,2004).9 We use the terms ‘‘turn’’ and ‘‘utterance’’ interchangeably in

this paper.10 Although an expression of emotion is not interchangeablewith the emotion itself (Russell et al., 2003), our use of the term‘‘emotion’’ hereafter should be understood as referring (whenappropriate) to the annotated expression of emotion.11 In (Litman and Forbes-Riley, 2004a), we have also exploredseparately annotating strong, weak and mixed emotions, as well




NCORREC

dent�s utterance, the output of the NLU componentis thus a subset of the semantic concepts that wereexpected as answers to the tutor�s prior question,and that were found when parsing the student�sutterance. For instance, the semantic concept down-ward is used in many of the semantic grammars, andwould be the semantic output for a variety of utter-ances such as ‘‘downwards’’, ‘‘towards earth’’, ‘‘is itdownwards’’, ‘‘down’’, etc.

The text response produced by Why2-Atlas (i.e.,the next node in the finite-state machine) is then sentto the Cepstral text-to-speech system6 and played tothe student through the headphone. After each systemprompt or student utterance is spoken, the systemprompt, or the system�s understanding of the student�sresponse (i.e., the output of the speech recognizer),respectively, are added to the dialogue history. Atthe time the screenshot in Fig. 1 was generated, forexample, the student had just said ‘‘free fall’’ (in thiscase the utterance was correctly recognized).

An emotion-annotated (Section 3) dialogue ex-cerpt from our human–computer corpus is shownin Fig. 3. The excerpt shows both what the studentsaid and what ITSPOKE recognized (the ASRannotations). As shown, the output of the auto-matic speech recognizer sometimes differed fromwhat the student actually said. When ITSPOKEwas not confident of what it thought the studentsaid, it generated a rejection prompt and asked thestudent to repeat. On average, ITSPOKE produced1.4 rejection prompts per dialogue. ITSPOKE alsomisrecognized utterances; when ITSPOKE heardsomething different than what the student said (aswith the last student turn) but was confident in itshypothesis, it proceeded as if it heard correctly.While the ITSPOKE word error rate in this corpuswas 31.2%, natural language understanding basedon speech recognition (i.e., the recognition ofsemantic concepts instead of actual words) is thesame as based on perfect transcription 92.4% ofthe time.7 The accuracy of recognizing semanticconcepts is more relevant for dialogue evaluation,as it does not penalize for word errors that areunimportant to overall utterance interpretation.

U

6 The Cepstral system is a commercial outgrowth of theFestival system (Black and Taylor, 1997).7 An internal evaluation of this semantic analysis component in

an early version of the Why2-Atlas system (with its typed input,and thus ‘‘perfect’’ transcription) yielded 97% accuracy (Rose,2005).

EDPROOF

3. An annotation scheme for student emotion

and attitude

3.1. Emotion classes

In our data, student ‘‘emotions’’8 can only beidentified indirectly: via what is said and/or how itis said. However, such evidence is not always obvi-ous, unambiguous, or consistent. For example, astudent may express anger through the use of swearwords, or through a particular tone of voice, or viaa combination of signals, or not at all. Moreover,another student may present some of these same sig-nals even when s/he does not feel anger.

In (Litman and Forbes-Riley, 2004a), we presenta coding scheme for manually annotating the stu-dent turns9 in our spoken tutoring dialogues forintuitively perceived expression of emotion. In thisscheme, expressions of emotion10 are viewed alonga linear scale, shown and defined as follows11:

negative neutral! positive

Negative12: a student turn that expresses emotionssuch as confused, bored, irritated, uncertain, sad.Examples of negative student turns in our human–human and human–computer corpora are found

as annotating specific emotions such as uncertain, irritated,

confident; complete details of our annotation studies are describedtherein.12 These ‘‘negative’’, ‘‘neutral’’ and ‘‘positive’’ emotion classescorrespond to traditional notions of ‘‘valence’’ (cf. Cowie andCornelius, 2003), but these terms are not related to the impact ofemotion on learning. For example, in work that draws on a‘‘disequilibrium’’ theory of the relationship between emotion andlearning, working through negative emotions is believed to be anecessary part of the learning process (Craig and Graesser, 2003).

T

436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477

478479480481482483484485486

487488489490

491

492493494495496497498499500501502503504505506507508509510511512513

Table 1Confusion matrix for human–human corpus annotation

Negative Neutral Positive




CORREC

in Figs. 2 and 3. Evidence13 of a negative emotioncan come from the lexical expressions of uncer-tainty, e.g., the phrase ‘‘I don�t know’’, a syntacticquestion, disfluencies, as well as acoustic and pro-sodic features, including pausing, pitch and energyvariation. For example, the negative student turn,student5, in Fig. 2, contains the phrase ‘‘I don�tknow why’’, as well as frequent internal pausingand a wide pitch variation.14 The negative studentturn, student19, in Fig. 3, displays a slow tempoand rising intonation.

Positive: a student turn that expresses emotionssuch as confident, enthusiastic. For example, student7in Fig. 2 is labeled positive. Evidence of a positiveemotion in this case comes from lexical expressionsof certainty, e.g., ‘‘It�s the . . .’’, as well as acousticand prosodic features, including loud speech and afast tempo. The positive student turn, student21, inFig. 3, displays a fast tempo with very little pausingpreceding the utterance.

Neutral: a student turn that does not express a po-sitive or negative emotion. Examples of neutral stu-dent turns are student8 in Fig. 2 and student22 inFig. 3. Acoustic and prosodic features, includingmoderate loudness, tempo, and inflection, give evi-dence for these neutral labels, as does the lack ofsemantic content in the grounding phrase, ‘‘mm-hm’’.

Emotion annotations were performed from bothaudio and transcription using the sound visualiza-tion and manipulation tool, Wavesurfer.15 The emo-tion annotators were instructed to try to annotateemotion relative to both context and task. By con-

text-relative we mean that a student turn in ourtutoring dialogues is identified as expressing emo-tion relative to the other student turns in that dia-logue. By task-relative we mean that a studentturn perceived during tutoring as expressing anemotion might not be perceived as expressing thesame emotion with the same strength in another(e.g., non-tutoring) situation. Moreover, the rangeof emotions that arise during tutoring might notbe the same as the range of emotions that arise dur-

UN

13 As determined by post-annotation discussion (see Section 7).14 As illustrated by the hyperlinks in Figs. 2 and 3, annotatorscould also listen to the recording of the dialogue, as detailedbelow. If your pdf reader does not support hyperlinks, you canlisten to these dialogue excerpts at this website: http://www.cs.pitt.edu/itspoke/pub/.15 (http://www.speech.kth.se/wavesurfer/). The tool is shown inFigs. 5 and 6.

EDPROOF

ing some other task. For example, consider the con-text of a tutoring session, where a student has beenanswering tutor questions with apparent ease. If thetutor then asks another question, and the studentresponds slowly, saying ‘‘Um, now I�m confused’’,this turn would likely be labeled negative. However,in the context of a heated argument between twopeople, this same turn might be labeled as a weak

negative, or even weak positive. Litman and For-bes-Riley (2004a) provides full details of our anno-tation scheme, including discussion of our codingmanual and annotation tool, while Section 7 com-pares our scheme to related work.

3.2. Quantifying inter-annotator agreement

We conducted a study for each corpus, to quan-tify the degree of agreement among two coders (theauthors) in classifying utterances using our annota-tion scheme. To analyze agreement in our human–human spoken tutoring corpus (Section 2.2), werandomly selected 10 transcribed dialogues from 9subjects, yielding a dataset of 453 student turns,where approximately 40 turns came from each ofthe 9 subjects. The 453 turns were separately anno-tated by the two authors, using the emotion annota-tion scheme described above. To analyze agreementin our human–computer corpus (Section 2.3), werandomly selected 15 transcribed dialogues from10 subjects, yielding a dataset of 333 student turns,where approximately 30 turns came from each of 10subjects. Each turn was again separately annotatedby the two authors.

Two confusion matrices summarizing the result-ing agreement between the two emotion annotatorsfor each corpus are shown in Tables 1 and 2. Therows correspond to the labels assigned by annotator1, and the columns correspond to the labels assigned

Negative 112 9 9Neutral 31 181 53Positive 1 10 47

Table 2Confusion matrix for human–computer corpus annotation


Negative 89 30 6Neutral 32 94 38Positive 6 19 19

http://www.cs.pitt.edu/itspoke/pub/

http://www.cs.pitt.edu/itspoke/pub/

http://www.speech.kth.se/wavesurfer/

514515516517518519520521522523524525526527528529530531

532533534535536537538539540541542543

Table 3Consensus labeling over emotion-annotated data


Human–human 119 273 61Human–computer 125 161 47




by annotator 2. For example, in Table 1, 112 nega-tives were agreed upon by both annotators, while 9of the negatives assigned by annotator 1 were la-beled as neutral by annotator 2, and 9 of the nega-tives assigned by annotator 1 were labeled aspositive by annotator 2. Note that across both cor-pora, annotator 2 consistently annotates more posi-tive and less neutral turns than annotator 1.

As shown along the diagonal in Table 1, the twoannotators agreed on the annotations of 340/453student turns on the human–human tutoring data,achieving 75.1% agreement (Kappa = 0.6, a =0.6).16 As shown along the diagonal in Table 2, thetwo annotators agreed on the annotations of 202/333 student turns in the human–computer tutoringdata, achieving 60.7% agreement (Kappa = 0.4,a = 0.4).17 It has generally been found to be difficultto achieve levels of inter-annotator agreement above

UNCORRECT

544545

546547548549550551552553554555556557558559560561562563564565566

16 Kappa and a are metrics for computing the pairwise agree-ment among annotators making category judgments. Kappa(Carletta, 1996; Siegel et al., 1988; Cohen, 1960) is computed as:PðAÞ�PðEÞ1�PðEÞ , where P(A) is the proportion of actual agreement among

annotators, and P(E) is the proportion of agreement expected bychance. a (Krippendorf, 1980) is computed as: 1� Dð0Þ

DðEÞ, whereD(O) is the proportion of observed disagreement betweenannotators and D(E) is the proportion of disagreement expectedby chance. When there is no agreement other than that expectedby chance, Kappa and a = 0. When there is total agreement,Kappa and a = 1. Krippendorf�s (1980) a and Siegel et al.�s (1988)version of Kappa are nearly identical; however, these two metricsuse a different method of estimating the probability distributionfor chance than does Cohen�s (1960) version of Kappa (DiEu-genio and Glass, 2004), which is used in this paper. Althoughinterpreting the strength of inter-annotator agreement is contro-versial (DiEugenio and Glass, 2004), Landis and Koch (1977) andothers use the following standard for Kappa: 0.21–0.40, ‘‘Fair’’;0.41–0.60, ‘‘Moderate’’; 0.61–0.80, ‘‘Substantial’’; 0.81–1.00,‘‘Almost Perfect’’. Krippendorf (1980) uses the following stricterstandard for a: a < .67, ‘‘cannot draw conclusions’’; .67 < a > .8,‘‘allows tentative conclusions’’; a > .8, ‘‘allows definite conclu-sions’’. Although neither metric is ideal for this study becausethey assume independent events, unlike other measures ofagreement such as percent agreement, Kappa and a take intoaccount the inherent complexity of a task by correcting forchance expected agreement.17 Since our emotion categories are ordinal/interval rather thannominal, we can also quantify agreement using a weightedversion of Kappa (Cohen, 1968), which accounts for the relativedistances between successive categories. With (quadratic) weight-ing, our Kappa values increase to .7 and .5 for the human–humanand human–computer annotations, respectively. Similarly, usingan interval version of a (Krippendorf, 1980) that also accountsfor a relative distance (of 1) between categories, a values increaseto .7 and .5 for the human–human and human–computerannotations, respectively.

EDPROOF

‘‘Moderate’’ (see footnote 16) for emotion annota-tion in naturally occurring dialogues. Ang et al.(2002), for example, report inter-annotator agree-ment of 71% (Kappa 0.47), while Shafran et al.(2003) report Kappas ranging between 0.32 and0.42. Such studies were nevertheless able to useacoustic–prosodic cues to effectively distinguishthese annotator judgments of emotion.

A number of researchers have accommodated forlow inter-annotator agreement for emotion annota-tion by exploring ways of achieving consensus be-tween disagreed annotations. Following Ang et al.(2002) and Devillers et al. (2003), we explored con-

sensus labeling, both with the goal of increasingour usable dataset for prediction, and to includethe more difficult annotation cases. For our consen-sus labeling, the original annotators revisited eachoriginally disagreed case, and through discussion,sought a consensus label. Due to consensus labeling,agreement rose in both our human–human and hu-man–computer data to 100%.18 A summary of thedistribution of emotion labels after consensus label-ing is shown in Table 3.

As in (Ang et al., 2002), we will experiment withpredicting emotions in Section 5 using both ouragreed data and our consensus-labeled data.19

Table 4 summarizes the characteristics of the emo-tion-annotated subsets of both our human and com-puter tutoring corpora, with respect to both theagreed and consensus emotion labels.

As a final note, during the annotation and subse-quent consensus discussions, we observed that thehuman–human and human–computer dialogues dif-fer with respect to a variety of characteristics. Manyof these differences are illustrated in the corpus

18 There were eight student turns in the human–human corpusfor which the annotators had difficulty deciding upon a consensuslabel; these cases were given the ‘‘neutral’’ consensus label as aresult.19 Although not discussed in this paper, we have also runprediction experiments using each individual annotator�s labeleddata; the results in each case were lower than those for the agreeddata, and were approximately the same as the results for theconsensus-labeled data, as discussed below.

T

F

567568569570571572573574575576577578579580581582583584585586587588

589590591592593594595596597598599600

601

602

603

604605606607

Table 4Summary of emotion-annotated data

Human–human Human–computer

Agreed Consensus Agreed Consensus

# students 9 9 10 10# dialogues 10 10 15 15# student turns 340 453 202 333# student words 1892 2879 478 833# unique student words 379 475 127 180Minutes student speech 13.48 19.38 11.28 18.92

Majority class (neutral) 53% 60% 47% 48%




REC

excerpts above, and in part reflect the fact that ourcomputer tutor is far less robust than our human tu-tor with respect to its interactiveness and under-standing capabilities. Such differences canpotentially impact both the emotional state of thestudent, and how the student is able to express anemotional state. We hypothesize that such differ-ences may have also impacted the comparative dif-ficulty in annotating emotion in the two corpora.For example, the average student turn length inthe 10 annotated human–human dialogues is 6.61words, while for the 15 human–computer dialoguesthe average turn length is 2.52 words. The fact thatstudents speak less in the human–computer dia-logues means that there is less information to makeuse of when judging expressed emotions. We alsoobserved that in the human–human dialogues, thereare more student initiatives and groundings as wellas references to prior problems. The limitations ofthe computer tutor may have thus restricted howstudents expressed themselves (including how theyexpressed their own emotional states) in other ways

UNCOR

Fig. 4. Twelve acoustic–prosodic

EDPROO

besides word quantity. Finally, the fact that thecomputer tutor made processing errors may haveimpacted both the types and quantity of studentemotional states. As shown in Table 3, there is ahigher proportion of negative emotions in the hu-man–computer corpus as compared to the human–human corpus (38% versus 26%, respectively). Aswe will see with our machine learning experimentsin Section 5, emotion prediction is also more difficultin the human–computer corpus, which may again inpart reflect its differing dialogue characteristics thatarise from the limitations of the computer tutor.

4. Extracting features from the speech signal

of student turns

4.1. Acoustic–prosodic features

For each of the emotion-annotated student turnsin the human–human and human–computer cor-pora, we computed the 12 acoustic and prosodicfeatures itemized in Fig. 4, for use in the machine

features per student turn.

T

608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644

645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681




RREC

learning experiments described in Section 5.20 Moti-vated by previous studies of emotion prediction inspontaneous dialogues in other domains (Anget al., 2002; Lee et al., 2001; Batliner et al., 2003),our acoustic–prosodic features represent knowledgeof pitch, energy, duration, tempo and pausing. As-pects of silence and pausing have been shown tobe relevant for categorizing other aspects of studentbehavior in tutoring dialogues as well (Fox, 1993;Shah et al., 2002). We focus on acoustic and pro-sodic features of individual turns that can be com-puted automatically from the speech signal andare available in real-time to ITSPOKE, since ourlong-term goal is to use these features to trigger on-line adaptation in ITSPOKE based on predictedstudent emotions.

F0 and RMS values, representing measures ofpitch excursion and loudness, respectively, werecomputed using Entropic Research Laboratory�spitch tracker, get_f0,21 with no post-correction. Apitchtracker takes as input a speech file, and outputsa fundamental frequency (f0) contour (the physicalcorrelate of pitch). In Fig. 5, for example, the ‘‘PitchPane’’ displays the f0 contour for the experimentallyobtained speech file shown in the ‘‘Student Speech’’pane, where the x-axis represents time and the y-axisrepresents frequency in Hz.22 Each f0 value corre-sponds to a frame step of 0.01 s across the studentturn ‘‘free fall?’’; the rising f0 contour is typical ofa question. Our features ‘‘maxf0’’ and ‘‘minf0’’ cor-respond to the highest and lowest f0 values (thepeaks and valleys) in the f0 contour, while‘‘meanf0’’ and ‘‘stdf0’’ are based on averaging overall the (non-zero) f0 values in the contour, which aregiven by get_f0 in frame steps of 0.01 s.

Energy can alternatively be represented in termsof decibels (dB) or root mean squared amplitude

UNCO 682

683684685686687688689690

20 In preliminary experiments for this paper, and also inprevious work (e.g., Litman and Forbes, 2003), we also investi-gated the use of two normalized versions of our acoustic–prosodic features, specifically, features normalized by either priorturn or by first turn. These normalizations have the benefit ofremoving the gender dependency of f0 features. However, wehave consistently found little difference in predictive utility forraw versus normalized features, in both our human–human andhuman–computer data, so use only raw (non-normalized) featurevalues here. As discussed below, however, we experiment with theuse of gender as an explicit feature.21 get_f0 and other Entropic software is currently available freeof charge at http://www.speech.kth.se/esps/esps.zip.22 The representations in Figs. 5 and 6 use the Wavesurfer soundvisualization and manipulation tool.

EDPROOF

(rms). For example, the ‘‘Energy Pane’’ at the bot-tom of Fig. 5 displays energy values computed indecibels across frame steps of 0.01 s for the studentspeech shown in the ‘‘Student Speech’’ pane, wherethe x-axis represents time and the y-axis representsdecibels. The variation in energy values across thisstudent turn reflects that the student�s utterance it-self is much louder than the ‘‘silences’’ before andafter (although as can be seen, the analysis picksup some minor background noise when the studentis not speaking). The get_f0 pitch tracker used inthis study computes energy as rms values based ona 0.03-s window within frame steps of 0.01 s. ‘‘max-rms’’ and ‘‘minrms’’ correspond to the highest andlowest rms values over all the frames in a studentturn, while ‘‘meanrms’’ and ‘‘stdrms’’ are based onaveraging over all the rms values in the frames ina student turn.

Our four temporal features were computed fromthe turn boundaries of the transcribed speech. Re-call that during our corpora collection, studentand tutor speech are digitally recorded separately,yielding a 2-channel speech file for each dialogue.In Fig. 6, the ‘‘Tutor Speech’’ and ‘‘Student Speech’’panes show a portion of the tutor and studentspeech files, while the ‘‘Tutor Text’’ and ‘‘StudentText’’ show the associated transcriptions. The verti-cal lines around each tutor and student utterancecorrespond to the turn segmentations. For example,the leftmost vertical line indicates that the tutor�sturn ‘‘what is that motion?’’ begins at approxi-mately 304.5 s (30,450 ms) into the dialogue. Recallthat in our human–human dialogues, these turnboundaries are manually labeled by our paid tran-scriber. In our human–computer dialogues, tutorturn boundaries correspond to the beginning andend times of the speech synthesis process, while stu-dent turn boundaries correspond to the beginningand end times of the student speech as detected bythe speech recognizer, and thus are a noisy estimateof the actual student turn boundaries.23

The duration of each student turn was calculatedby subtracting the turn�s beginning from its endingtime. In Fig. 6, for example, the duration of the stu-dent�s turn is approximately 0.90 s (307.40–306.50 s).

23 While we manually transcribed the lexical information toquantify the error due to speech recognition, we did not manuallyrelabel turn boundaries; we thus can not quantify the level ofnoise introduced by automatic turn segmentation.

http://www.speech.kth.se/esps/esps.zip

T

PROOF

691692693694695696697698699700701702703704705706707708709710

711712713714715716717718719720721722723

724

725

726727728

Fig. 5. Computing pitch and energy-based features.




CORREC

The preceding pause (prepause) before a studentturn began was calculated by subtracting the endingtime of the tutor�s (prior) turn from the beginningtime of the student�s turn. In Fig. 6, for examplethe duration of the pause preceding the student�sturn is 0.85 s (306.50–305.65 s).24

The speaking rate (tempo) was calculated as syl-lables per second in the turn (where the number ofsyllables in the transcription was computed usingthe Festival text-to-speech OALD dictionary, andthe turn duration computed as above).25 For exam-ple, in Fig. 6, there are five syllables in the studentturn (‘‘the freefall motion’’), and the duration ofthe student turn is 0.90 s (as computed above), thusthe speaking rate in the turn is 5.56 syllables/second.In this paper, we computed tempo in the human–computer dialogues based on the human transcrip-tion of the student turns. Although this more closelyreflects the actual tempo rather than the noisier tem-po computed on the automatic speech recognition

UN 729

73024 Note that in the human–human corpus, if a student turnbegan before the prior tutor turn ended (i.e., student barge-insand overlaps), the preceding pause feature for that turn was ‘‘0’’.If a student turn initiated a dialogue or was preceded by a studentturn (rather than a tutor turn), its preceding pause feature wasnot defined for that turn. In the human–computer corpus, everystudent turn is preceded by a tutor turn.25 Note that this method calculates only a single (average) tempoof the turn, because we were not sampling the tempo at sub-intervals throughout the turn.

EDoutput, Ang et al. (2002) compared machine learn-

ing experiments using features such as tempo com-puted both on the human transcription and on theautomatically recognized speech, and found thatthe prediction results were comparable.

Amount of silence (intsilence) was defined as thepercentage of frames in the turn where the probabil-ity of voicing = 0; this probability is available fromthe output of the get_f0 pitch-tracker, and theresulting percentage represents roughly the percent-age of time within the turn that the student was si-lent.26 For example, the student turn in Fig. 6 hasapproximately 31% internal silence.

4.2. Adding ‘‘Identifier’’ features representing

the student and problem

Finally, we also recorded for each turn the 3‘‘identifier’’ features shown in Fig. 7, all of whichare automatically available in ITSPOKE throughstudent login. Prior studies (Oudeyer, 2002; Leeet al., 2002) have shown that ‘‘subject’’ and ‘‘gen-

26 Using the percentage of unvoiced frames as a measure ofsilence will overestimate the amount of silence in the turn,because e.g., long unvoiced fricatives will be included; however ithas been used in previous work as a rough estimate of internalsilence (Litman et al., 2001). In our data, energy was rarely zeroacross the individual frames per turn, and thus was not a betterestimate of internal silence.

TEDPROOF

731732733734735736737738

739740

741742743

744745746747748749750751752753754755756757758

Fig. 6. Computing temporal features.

Fig. 7. Three identifier features per student turn.




NCORREC

der’’ features can play an important role in emotionrecognition, because different genders and differentspeakers can convey emotions differently. ‘‘subjectID’’ and ‘‘problem ID’’ are uniquely important inour tutoring domain, because in contrast to e.g., callcenters, where most callers are distinct, students willuse our system repeatedly, and problems are re-peated across students.27

5. Predicting student emotion from

acoustic–prosodic features

We next performed machine learning experi-ments with acoustic–prosodic features and our emo-tion-annotated student turns, to explore how well

U 759760761762763764765

27 In preliminary experiments for this paper, we examined theuse of only ‘‘subject ID’’ as well as only ‘‘subject ID’’ and‘‘gender’’, with the view that these identifier features generalizedto other domains besides physics. Overall we found that these twosubsets produced results that were the same as including the‘‘problem ID’’ feature.

the 12 acoustic–prosodic features discussed in Sec-tion 4 predict the emotion labels in both our hu-man–human and human–computer tutoringcorpora. We explore the predictions for our origi-nally agreed emotion labels and our consensus emo-tion labels (Section 3.2). Using originally agreeddata is expected to produce better results, since pre-sumably annotators originally agreed on cases thatprovide more ‘‘clear-cut’’ prosodic informationabout emotional features (Ang et al., 2002), butusing consensus data is worthwhile because it in-cludes the less clear-cut data that the computer willactually encounter.

For these experiments, we use a boosting algo-rithm in the Weka machine learning software (Wit-ten and Frank, 1999). In general, the ‘‘boosting’’algorithm, called ‘‘AdaBoostM1’’ in Weka, enablesthe accuracy of a ‘‘weak’’ learning algorithm to beimproved by repeatedly applying that algorithm todifferent distributions or weightings of trainingexamples, each time generating a new weak predic-tion rule, and eventually combining all weak predic-

T

766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817

818819820821822823824825826827828829830831832833834835836837838839840841842843844845846

847848849

850851852853854855

28 In a separate experiment we found that testing on one speakerand training on the others, averaged across all speakers, does notsignificantly change the results. In particular, we compared theresults on 1 run of 10-fold cross-validation using our best-performing feature set from Forbes-Riley and Litman (2004), andfound that training and testing on the same set of speakersimproved the accuracy by only 3% as compared to the averageresult of testing on one speaker and training on the others.29 We applied the Bonferroni Correction (a statistical adjust-ment for comparing multiple feature sets applied to the samedataset) to compare our 30 human–human features sets and our34 human–computer feature sets, which yielded this t-value of 3.3(at a = .001).




UNCORREC

tion rules into a single prediction (Freund and Scha-pire, 1996). We found in prior studies of our hu-man–human data (Litman and Forbes, 2003) thatboosted decision trees consistently yielded more ro-bust performance compared to several other learn-ing methods, thus we continue their use here. Forcomparison, we use a standard baseline algorithm,called ‘‘ZeroR’’ in Weka; this algorithm simply pre-dicts the majority class (‘‘neutral’’) in the trainingdata, and thus is used as a performance benchmarkfor the boosting algorithm.

For the machine learning experiments discussedin this section, we created 26 different feature setsfrom the features listed in Figs. 4 and 7, to studythe effects that acoustic–prosodic features had onpredicting student emotion, either with or withoutour supplementary ‘‘identifier’’ features. First, wecreated an ‘‘allspeech’’ feature set, containing all12 of the acoustic prosodic features in Fig. 4. The‘‘allspeech’’ feature set explores how well our emo-tion labels are predicted from (potentially) all ofour 12 acoustic–prosodic features in combination.Note that while use of this feature set makes all 12acoustic–prosodic features available to the learningalgorithm, the algorithm itself incorporates an inter-nal feature selection process; as will be illustratedwhen discussing feature usage, the learned classifica-tion models sometimes contain only subsets of theacoustic–prosodic features. Next, we created 12individual feature sets, one for each feature inFig. 4; these sets provide individual performancebaselines for the ‘‘allspeech’’ feature set, by explor-ing how well each individual acoustic–prosodic fea-ture alone predicts the emotion labels in our twocorpora. While the predictive utility of a single fea-ture in isolation does not necessarily characterize itscontribution when other features are available, fol-lowing Batliner et al. (2000), we include these base-lines because in addition to trying to optimizeclassification by incorporating multiple features,we are also interested in understanding how the pre-dictive utility of different types of acoustic–prosodicinformation generalizes across different types ofdata (acted versus read versus wizard-of-oz speechdata in (Batliner et al., 2000), and human–humanversus human–computer spoken dialogues in ourwork). From a system building perspective, examin-ing the performance of single features also allows usto understand potential tradeoffs between monitor-ing more information (which can increase timeand/or memory overhead) and predictive accuracy.Finally, for each of these 13 feature sets we created

EDPROOF

a corresponding ‘‘+id’’ counterpart, which adds thethree ‘‘identifier’’ features in Fig. 7. These 13 ‘‘+id’’feature sets study how much additional predictiveability the identifier features supply to the 13 acous-tic–prosodic feature sets. Again, from a practicalperspective, examining the role of identifiers allowsus to understand potential tradeoffs betweenretraining emotion prediction models whenever anew user and/or physics problem is added, and pre-dictive accuracy.

For each feature set examined in our experi-ments, we report the mean predictive accuracy (%Correct), the interval which when added to themean accuracy produces the 95% confidence inter-val (95% CI), and the relative improvement in errorreduction (% Rel.Imp.) over our standard baselinealgorithm (which always predicts the majority class,‘‘neutral’’). Accuracies are computed across 10 runsof 10-fold cross-validation, where for each cross-validation, the training and test data are drawnfrom utterances produced by the same set of speak-ers. (Since the goal of the system is to repeat usersover the course of a semester, we expect that aftera few dialogues with a given user we would be ableto train the emotion predictor for that user.)28 Sig-nificances of accuracy differences between each fea-ture set and the majority class baseline areautomatically computed in Weka using a two-tailedt-test (p < .05). We then computed standard errorfrom the standard deviation (which is also automat-ically computed in Weka) as SE ¼ stdðxÞ

sqrtðnÞ, wheren = 10 because there are 10 runs. From the standarderror, we computed the 95% confidence interval todetermine whether the accuracy differences betweenany two feature sets are statistically significant. Forany feature set, the mean accuracy +/�3.3 * SE = the 95% confidence interval.29 If the con-fidence intervals for two feature sets are non-over-

856857858859860

861862

863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895

Table 5%Correct (10X10cv); human–human agreed; MAJ(neutral) = 53%

Feature set %Correct 95% CI %Rel.Imp.

allspeech 73 +/� 2 42




lapping, then their mean accuracies are significantlydifferent with 95% confidence. Finally, relativeimprovement over the baseline error for any featureset x is computed as %Rel.Imp. ¼ errorðbaselineÞ�errorðxÞ

errorðbaselineÞ ,where error(x) is 100 minus the %Correct(x) value.30

TEDPROOF

896897898899900901902903904905906907908909

duration 73 +/� 1 42prepause 64 +/� 1 23minrms 61 +/� 1 16tempo 60 +/� 1 14stdrms 58 +/� 1 10stdf0 57 +/� 1 9minf0 57 +/� 1 8maxrms 55 +/� 1 3

Table 6%Correct (10X10cv); human–human agreed; MAJ(neutral) = 53%


allspeech+id 74 +/� 1 45duration+id 68 +/� 1 32tempo+id 65* +/� 1 26maxrms+id 64* +/� 1 24minrms+id 64* +/� 1 22stdrms+id 63* +/� 1 21prepause+id 61* +/� 1 16minf0+id 61* +/� 1 16stdf0+id 60* +/� 1 15maxf0+id 56* +/� 1 6meanrms+id 55* +/� 1 5

CORREC

5.1. Predicting student emotions in the

human–human tutoring corpus

5.1.1. Predicting agreed human–human emotion

labels

Our first machine-learning experiment exploreshow well each of our 26 feature sets predicts the340 emotion-labeled student turns in our human–human spoken tutoring corpus that both annotatorsoriginally agreed on, as discussed in Section 3.2.

Tables 5 and 6 present our results for thoseacoustic–prosodic feature sets (with and withoutidentifier features, respectively) that significantlyoutperformed the baseline. The accuracy of thebaseline algorithm (MAJ) in our first experiment is53%, as shown in the table caption. Hereafter, a‘‘*’’ in the ‘‘+id’’ tables indicates that adding identi-fier features significantly improved the accuracy ofthe same acoustic–prosodic feature set without iden-tifiers, as reported in the corresponding ‘‘�id’’ table.

In Table 5, ‘‘allspeech’’, which contains all 12acoustic–prosodic features, is one of the two best-performing feature sets; it significantly outperformsalmost all the individual acoustic–prosodic featuresets, and has a relative improvement of 42% overthe baseline error. The ‘‘duration’’ feature, whichperforms statistically the same as the ‘‘allspeech’’feature set, also significantly outperforms all otherindividual acoustic–prosodic feature sets. Theremaining individual feature sets that are shown assignificantly outperforming the majority class base-line have relative improvements ranging from 3% to23%, with the temporal features in general provid-ing the most predictive utility. Those four individualfeature sets that aren�t shown (‘‘intsilence’’,‘‘maxf0’’, ‘‘meanf0’’ and ‘‘meanrms’’) perform sta-

UN 910

911912913914915916917918

30 We report our results as integers because reporting thefractions produced by 10-fold cross-validation creates the incor-rect impression that increasing or decreasing the least significantdigit by 1 (e.g., going from 48.3% to 48.4%) yields a difference inhow a single case is classified. For example, in order to reportaccuracies to a tenth of a percent, we would need a datasetcontaining at least 1000 cases.

tistically the same or slightly worse than thebaseline.

As shown in Table 6, adding identifier features toour 13 acoustic–prosodic feature sets significantlyimproves their performance in most cases. ‘‘all-speech+id’’, which is still the best-performing fea-ture set, now significantly outperforms all theindividual acoustic–prosodic feature sets (includingduration). Within the individual feature sets, rela-tive improvements now range from 5% to 32%, witha different relative utility of the feature sets. Forexample, ‘‘maxrms+id’’ significantly outperforms‘‘prepause+id’’, while without the identifiers the re-verse is the case. Furthermore, two of the four indi-vidual feature sets (‘‘maxf0’’, ‘‘meanrms’’) thatpreviously performed statistically the same or worsethan the baseline now perform statistically betterthan the baseline (although they yield the lowestaccuracies in the table). However, the performanceof two of the individual temporal feature sets(‘‘duration’’ and ‘‘prepause’’) has decreased withthe addition of identifier features, suggesting over-fitting and/or an insufficient amount of data.

T

919920921922923924925926927928929930931932933

934935936937938939940941942943944945

946947948949950951952953954955956957958959960961962963964965966967968969970971972973




Overall, these results suggest that many acoustic–prosodic features alone or in combination with eachother or with identifier features, are useful predic-tors for student emotion, for the ‘‘Agreed’’ emotionlabels in our human–human spoken tutoring dia-logue corpus. Many feature sets outperform ourbaseline, with combining acoustic–prosodic featuresyielding the best performance. When the featuresare used alone, we see a trend that temporal featuresprovide the most predictive utility, followed by en-ergy features, and lastly by pitch features; theseobservations hold both with and without identifiers.Furthermore, the addition of the identifiers seems toparticularly improve the performance of the energyand pitch features.

5.1.2. Predicting consensus human–human emotion

labels

Our second machine learning experiment ex-plores how well each of our 26 feature sets predictsall 453 emotion-labeled student turns in our hu-man–human spoken tutoring corpus, including the113 student turns that both annotators originallydisagreed on and subsequently achieved consensus

on, as discussed in Section 3.2.Tables 7 and 8 present our results for those

acoustic–prosodic feature sets (with and withoutidentifier features, respectively) that significantly

UNCORREC 974

975976977978979980981982983984985986987988989990991992993994995996997

Table 7%Correct (10X10cv); human–human consensus; MAJ(neutral) = 60%


allspeech 66 +/� 2 15tempo 64 +/� 0 8prepause 63 +/� 1 8duration 63 +/� 0 7

Table 8%Correct (10X10cv); human–human consensus; MAJ(neutral) = 60%


allspeech+id 66 +/� 2 16tempo+id 67* +/� 1 18stdrms+id 65* +/� 1 12prepause+id 65 +/� 1 11minrms+id 63* +/� 1 8minf0+id 63* +/� 1 6duration+id 63 +/� 1 6maxf0+id 63* +/� 0 6maxrms+id 62* +/� 1 5

EDPROOF

outperformed the baseline. The accuracy of thebaseline algorithm (MAJ) in our second experimentis 60%, as shown in the table caption.

A comparison of Tables 5–8 shows that overall,using consensus-labeled data decreased the perfor-mance across many feature sets, both with and with-out identifier features. Although ‘‘allspeech’’ is stillthe best-performing feature set, its predictive utilityis significantly worse for the consensus data than forthe originally agreed data, with respect to bothabsolute accuracy and relative improvement. Itnow significantly outperforms the ‘‘duration’’ fea-ture alone, but performs statistically the same asmost of the individual features shown in Tables 7and 8, i.e. both with and without identifiers. The‘‘duration’’ feature no longer significantly outper-forms all other individual features, either with orwithout identifier features.

Moreover, it is no longer the case that most indi-vidual feature sets perform statistically better thanthe baseline; this is now true of only three of thetwelve individual feature sets without identifier fea-tures, where relative improvements range from 7%to 8%. Note that as with the agreed data, the tempo-ral features again provide the best predictive utility.As shown in Table 8, with identifier features, eightof the individual feature sets perform statisticallybetter than the baseline, and relative improvementsrange from 5% to 18%. As with the agreed data,these feature sets are a superset of those found tobe of predictive utility without identifiers, and add-ing identifier features does not improve the perfor-mance of ‘‘allspeech’’ or ‘‘duration’’. Both withand without identifiers, the range of relativeimprovement is much less than the correspondingrange on the agreed data.

Taken together, the results of these two experi-ments suggest that while acoustic–prosodic features(with and without identifiers) can still significantlyoutperform a baseline classifier for predicting con-sensus-labeled student emotions in our human–hu-man spoken tutoring dialogue corpus, they arequantitatively much more effective at predictingour originally agreed emotion labels. This may bedue in part to the fact that 82% (92/113) of theadded consensus labels were ‘‘neutral’’, which in-creases the accuracy of the majority class baseline,and ‘‘dilutes’’ the accuracy of the non-neutral pre-dictions. Qualitatively, many of the same observa-tions hold across experiments on the agreed andconsensus-labeled data (e.g., the utility of featurecombination and of individual temporal features,

T

998999

1000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039

104010411042104310441045104610471048104910501051105210531054

1055

1056

10571058

10591060106110621063106410651066106710681069107010711072107310741075107610771078




CORREC

the utility of identifiers for the pitch and energyfeatures).

5.1.3. Feature usage in human–human corpuspredictions

Our third machine learning experiment is in-tended to give us an intuition about how acoustic–prosodic features in combination are used to predictemotion classes in our human–human tutoring dia-logues. As discussed above, our machine learningalgorithm ‘‘boosts’’ a decision tree algorithm. Be-cause the output of the boosting algorithm includesmany decision trees which are then weighted andcombined, for this section, we reran the decision treealgorithm without boosting on the ‘‘allspeech’’ fea-ture set, to create a single decision tree that we couldmore easily analyze. For this analysis and those be-low regarding feature usage, we restrict our discus-sion to our agreed data (representing the clearestcases of emotion expression), and to the ‘‘�id’’ ver-sions of our feature sets (representing features thatwere also often used in other studies).

Table 9 shows the feature usages of the 12 acous-tic prosodic features from the ‘‘allspeech’’ featureset, based on the structure of the learned decisiontree. Following Ang et al. (2002), feature usage is re-ported as the percentage of decisions for which thefeature type is queried; thus features higher in thetree have higher usage than features lower in thetree. As shown, the temporal features are the mosthighly queried features, with duration features que-ried most, followed by the four pitch features, whichare queried roughly equally. Energy features arerarely queried in this tree.

From an examination of the paths through thedecision trees, we can make a number of generaliza-tions about how acoustic–prosodic features are usedto predict emotion classes in our human–humancorpus. Student turns with longer durations andlonger preceding pauses are predicted negative,while those with shorter durations are predictedpositive if they have short preceding pauses, andare predicted neutral otherwise. Student turns with

UN

Table 9Feature usage for ‘‘allspeech’’, human–human agreed

Pitch 19% Energy 5% Temporal 76%

minf0 6% maxrms 3% duration 40%meanf0 6% meanrms 2% tempo 17%maxf0 4% minrms 0% prepause 15%stdf0 3% stdrms 0% intsilence 4%

EDPROOF

lower values across pitch features are generally pre-dicted positive, while student turns with higher val-ues across pitch features are generally predictednegative or neutral.

This analysis supports the results of the first twoexperiments, which imply that in the absence ofidentifier features, temporal features are amongthe most important features used to decide the emo-tion classification of a turn. Pitch and energy fea-tures are less highly queried and less predictive ontheir own. Interestingly, while Table 5 suggests thatenergy features have greater predictive utility thanpitch features in isolation, Table 9 suggests a differ-ent pattern when all features are used incombination.

5.2. Predicting student emotions

in the human–computer tutoring corpus

5.2.1. Predicting Agreed human–computer emotionlabels

Our fourth machine-learning experiment ex-plores how well each of our 26 feature sets predictsthe 202 emotion-labeled student turns in ourhuman-computer spoken tutoring corpus that bothannotators originally agreed on, as discussed inSection 3.2.

Tables 10 and 11 present our results for thoseacoustic–prosodic feature sets (with and withoutidentifier features, respectively) that significantlyoutperformed the baseline. The accuracy of thebaseline algorithm (MAJ) in our fourth experimentis 47%, as shown in the table caption.

In Table 10 ‘‘allspeech’’ is the best-performingfeature set; it significantly outperforms almost allthe individual acoustic–prosodic feature sets, andhas a relative improvement of 17% over the baselineerror. The ‘‘duration’’ feature significantly outper-forms all other individual acoustic–prosodic fea-tures, and performs statistically the same as‘‘allspeech’’. This is the same result that was found

Table 10%Correct (10X10cv); human–computer agreed; MAJ(neutral) = 47%


allspeech 55 +/� 3 17duration 53 +/� 1 13maxrms 49 +/� 0 4minf0 48 +/� 1 3stdrms 48 +/� 1 2

T1079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112

111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163

Table 11%Correct (10X10cv); human–computer agreed; MAJ(neutral) = 47%


allspeech+id 62* +/� 3 29meanf0+id 64* +/� 2 33maxrms+id 64* +/� 1 33duration+id 64* +/� 1 32stdrms+id 63* +/� 1 31intsilence+id 62* +/� 2 29prepause+id 62* +/� 2 29maxf0+id 62* +/� 2 28tempo+id 61* +/� 3 28minf0+id 61* +/� 2 28stdf0+id 61* +/� 2 28meanrms+id 61* +/� 1 27minrms+id 60* +/� 2 26




UNCORREC

on the human–human agreed data. From a systems-building perspective, the relative utility of this sim-ple model is also a useful finding: duration is alreadyavailable in the ITSPOKE logs, while monitoring allof the features used by ‘‘allspeech’’ would requirethe addition of new system modules. Note howeverthat in contrast to the human–human agreed data, itis not the case that most individual feature sets sig-nificantly outperform the majority class baseline;only four individual feature sets outperform thebaseline, with relative improvements ranging from2% to 13%. These same four feature sets also out-performed the baseline in the human–human data;however, duration is now the only temporal featureto outperform the baseline in the human–computerdata. Interestingly, in (Lee et al., 2001), a differentrelative ranking of features is found, with respectto their usefulness as predictors of negative/non-negative emotions in a corpus of human–computerdialogues from a call center application. There,mean energy is the best predictor in isolation, fol-lowed by maximum energy, energy range, and med-ian pitch. Although temporal features were not usedin that study, the different ranking of energy andpitch features indicates that to some extent, emotionprediction in naturally occurring dialogues might bea domain-specific task. Finally, there is generally adecrease in both relative and absolute performanceacross feature sets on the human–computer dataas compared to the human–human data, althoughbaseline performance also decreases in the human–computer data.

As shown in Table 11, adding identifier featuresto our 13 acoustic–prosodic feature sets significantly

1164

EDPROOF

improves their performance across the board; we seethat all feature sets now perform statistically betterthan the baseline, and that all of the feature sets inTable 10 perform significantly better with the addi-tion of identifiers. However, ‘‘allspeech+id’’, with arelative improvement of 29% over the baseline er-ror, is no longer the best-performing feature set,and it performs statistically the same as all of theindividual feature sets. ‘‘meanf0+id’’, ‘‘max-rms+id’’ and ‘‘duration+id’’ are now the best-per-forming feature sets, although statistically theyperform the same as most of the other feature sets.In fact, for the individual feature sets with identifi-ers, average absolute performance (% Correct) isnow on par with that in the human–human corpus,and relative improvements are somewhat higher,ranging from 26% to 33%. As with the human–hu-man corpus, the relative utility of the individual fea-tures also changes with and without identifiers.

In sum, for the ‘‘Agreed’’ emotion labels in ourhuman–computer spoken tutoring dialogue corpus,our results suggest that acoustic–prosodic fea-tures—primarily in combination with identifier fea-tures—are the most useful predictors for studentemotion. For the same experiment on the human–human data, the addition of the identifier featuresdid not significantly improve the performance ofthe best feature sets. Also, the best results in our hu-man–computer corpus are worse than in our hu-man–human corpus, with both lower rawaccuracies and lower relative improvements. This,along with the lower baseline in the human–com-puter corpus, suggests that the human–computercorpus presents a more challenging task. In fact,as was shown in the confusion matrices for ouremotion annotations (Tables 1 and 2), in additionto getting lower best predictive accuracies in the hu-man–computer corpus as compared to the human–human corpus, we also found the student turns inthe human–computer corpus more difficult to anno-tate for emotion. We hypothesize that this is in partbecause student turns are much shorter in the hu-man–computer corpus. This means that there is lessacoustic–prosodic information in the human–com-puter student turns to make use of when predictingemotions, and may account for the lower annota-tion accuracy as well, as discussed in Section 3.The lower results in our human–computer corpusmay also reflect the fact that our acoustic–prosodicfeatures are extracted based on turn boundaries pro-duced automatically by the speech recognizer, whichintroduces noise into the acoustic–prosodic feature

116511661167

1168116911701171117211731174117511761177117811791180118111821183118411851186118711881189

11901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215




values. In contrast, the manually annotated turnboundaries in our human–human corpus yield moreaccurate acoustic–prosodic feature values.

5.2.2. Predicting consensus human–computer

emotion labels

Our fifth machine learning experiment exploreshow well each of our 26 feature sets predicts the333 emotion-labeled student turns in our human–human spoken tutoring corpus, including the 131student turns that both annotators originally dis-agreed on and subsequently achieved consensus on(Section 3.2).

Tables 12 and 13 present our results for thoseacoustic–prosodic feature sets (with and withoutidentifier features, respectively) that significantlyoutperformed the baseline. For comparison withother tables, ‘‘allspeech’’ is also shown, although itperformed statistically the same as the baseline.The accuracy of the baseline algorithm (MAJ) inthis fifth experiment is 48%, as shown in the tablecaption.

A comparison of Tables 10–13 shows that over-all, using consensus-labeled data decreased perfor-mance across feature sets. This was also found inour human–human tutoring data, and this was also

UNCORRECT 1216

121712181219122012211222

1223

12241225122612271228122912301231123212331234123512361237123812391240

Table 12%Correct (10X10cv); human–computer consensus; MAJ(neutral) = 48%


allspeech 49 +/� 2 –duration 52 +/� 1 6



allspeech+id 52* +/� 1 7

duration+id 57* +/� 1 16tempo+id 57* +/� 1 16stdrms+id 57* +/� 1 16meanf0+id 56* +/� 1 16prepause+id 56* +/� 1 16stdf0+id 56* +/� 1 15maxf0+id 56* +/� 1 15maxrms+id 56* +/� 2 15minrms+id 56* +/� 2 15minf0+id 55* +/� 1 13intsilence+id 54* +/� 2 12meanrms+id 54* +/� 2 11

EDPROOF

found in (Ang et al., 2002), where both agreed andconsensus data were used when predicting emotionin a corpus of human–computer dialogues regardingtravel arrangements. In fact, without identifier fea-tures, only one feature set (‘‘duration’’) significantlyoutperforms the baseline, with a relative improve-ment of 6%; duration was also found to be a usefulpredictor in our comparable human–human corpusexperiment. With identifier features, all 13 featuresets significantly outperform the baseline, with rela-tive improvements ranging from 11% to 16%, butthis is a lesser degree of improvement than foundwhen using agreed data.

Taken together, the results of these two experi-ments suggest that while acoustic–prosodic featuresand identifier features are useful predictors in com-bination for student emotion in our human–com-puter spoken tutoring dialogue corpus, theoriginally agreed emotion labels provide a moreaccurate testbed for predictions. This was alsofound in our human–human data. Again, this maybe due in part to the fact that 50% (67/131) of theadded consensus labels were ‘‘neutral’’, which in-creases the accuracy of the majority class baseline,and ‘‘dilutes’’ the accuracy of the non-neutral pre-dictions. Contrary to the results of our human–hu-man experiments, combining acoustic–prosodicfeatures did not improve performance. However,even more than in our human–human experiments,the addition of identifiers improved performance. Inparticular, for every acoustic–prosodic feature set,the version with identifiers outperformed both thebaseline and its counterpart without identifiers.

5.2.3. Feature usage in human–computer corpus

predictionsOur sixth machine learning experiment is in-

tended to give us an intuition about how acoustic–prosodic features are used to predict emotion classesin our human–computer tutoring dialogues. Asabove, we ran the decision tree algorithm withoutboosting, using the ‘‘allspeech’’ feature set on ouragreed human–computer emotion-labeled data.

Table 14 shows the feature usages of the 12acoustic prosodic features from the ‘‘allspeech’’ fea-ture set, based on the structure of the decision tree.As shown, temporal features overall are the mosthighly queried, with each individual feature queriedroughly equally. Within the pitch and energy fea-tures, mean pitch and maximum energy are queriedthe most, roughly equally to the individual temporalfeatures; the remaining pitch and energy features are

T

124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276

12771278

127912801281

1282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312

1313

1314

13151316131713181319132013211322

31 Weka provides facilities for transforming string-valued fea-tures containing arbitrary textual values into word vectors (bycreating one feature for each word which encodes the presence orabsence of that word within the string). Thus, this new lexicalfeature set represents the transcribed lexical items in each studentturn as a bag of words, or more specifically, as a word occurrence

Table 14Feature usage for ‘‘allspeech’’, human–computer agreed

Pitch 20% Energy 24% Temporal 56%

meanf0 14% maxrms 14% intsilence 16%stdf0 3% meanrms 6% tempo 14%minf0 2% minrms 4% duration 13%maxf0 1% stdrms 0% prepause 13%




UNCORREC

rarely queried. Note that energy features are queriedmuch more, and duration much less, than in the hu-man–human agreed data. Interestingly, Ang et al.(2002) also found that temporal features were mostfrequently queried, when predicting emotion in asimilar experiment with a corpus of human–com-puter dialogues regarding travel arrangements.However, in that study, duration and tempo weremost frequently queried, and the amount of internalsilence was not among the most frequently queriedtemporal features as is the case here. Moreover, inthat study, maximum pitch was the most frequentlyqueried pitch feature, while here mean pitch is themost frequently queried pitch feature. This compar-ison shows that while generalizations about the fea-tures useful to emotion prediction can be madeacross naturally occurring corpora, to some extentthe specific usefulness of emotion predictors willbe domain specific.

From an examination of the paths through thedecision trees, we can make a number of generaliza-tions about how acoustic–prosodic features are usedto predict emotion classes in our human–computerspoken tutoring corpus. Student turns with highermean pitch are predicted negative. Student turnswith longer durations and lower minimum energiesare also predicted negative, while student turns withlonger durations and higher maximum energies arepredicted positive. In (Ang et al., 2002), it was alsofound that longer durations and larger overall pitchfeatures were associated with frustration (whichwould be a subclass of negative in our emotionscheme), which is generally consistent with our re-sults. However, we do not consistently find thatslower tempos are associated with negative emo-tions, which is found in (Ang et al., 2002).

6. Adding a feature representing the student turn

transcription

While Section 5 shows that information in thespeech signal is one important source of informa-tion for accurately modeling student emotion dur-

EDPROOF

ing tutoring dialogues, speech-based features onlyaddress how something is said. What a student saysis also important; the addition of features represent-ing such information has already been shown to beuseful for speech-based emotion prediction fromnaturally occurring spoken dialogue data in otherdomains (Narayanan, 2002; Batliner et al., 2003;Devillers et al., 2003; Shafran et al., 2003; Leeet al., 2002; Ang et al., 2002; Litman et al., 2001;Batliner et al., 2000). As a first step towards investi-gating whether such improvements will also occur inspoken tutoring dialogues, we added to our speech-based feature sets a new string-valued feature repre-senting the transcription of each student turn.31 Forthe human–computer corpus, we used both a hu-man transcription and the speech recognizer�s tran-scription. As in many of the prior studies(Narayanan, 2002; Lee et al., 2002; Ang et al.,2002; Batliner et al., 2000; Batliner et al., 2003),we also examined the relative utility of using lexicalfeatures instead of speech-based features. As in Sec-tion 5, besides trying to optimize performance bycombining features, we are interested in the compar-ative analyses to better understand whether the rel-ative utility of different knowledge sourcesgeneralizes across our human–human and human–computer corpora, and to understand tradeoffs be-tween predictive accuracy and more system-orientedconcerns (e.g., lexical information is already loggedin a typical dialogue system, while many of ouracoustic–prosodic features are not).

6.1. Predicting human–human emotion labels

with lexical information

In our human–human corpus, we created ournew lexical feature representing the turn�s transcrip-tion, ‘‘text’’, and a corresponding feature set, ‘‘tex-t+id’’, that also includes our three identifierfeatures. We then performed a series of machinelearning experiments with these lexical feature sets,alone and in combination with our 26 acoustic–pro-sodic feature sets, to examine whether or not the

vector, indicating the lexical items that are present in the turn.

T

OF

1323132413251326132713281329133013311332133313341335133613371338133913401341134213431344134513461347134813491350135113521353135413551356135713581359136013611362

136313641365136613671368136913701371137213731374137513761377137813791380138113821383138413851386138713881389139013911392139313941395139613971398139914001401

Table 15%Correct (10X10cv); human–human agreed; MAJ (neutral) =53%


allspeech 73 +/� 2 42allspeech+id 74 +/� 1 45

text 79 +/� 1 54text+id 77 +/� 1 51

text+allspeech 79 +/� 1 57text+allspeech+id 78 +/� 1 53

Table 16%Correct (10X10cv); human–human consensus; MAJ (neutral) =60%

Feature set %Correct Std. Error %Rel.Imp.


text 70 +/� 1 24text+id 69 +/� 1 23

text+allspeech 71 +/� 1 27text+allspeech+id 71 +/� 1 27




UNCORREC

addition of lexical information improves the perfor-mance of acoustic–prosodic features, with or with-out identifier features.

Table 15 presents the best results of these exper-iments when run on the 340 emotion-labeled studentturns in our human–human spoken tutoring corpusthat both annotators originally agreed on. For com-parison, our best-performing feature sets from theacoustic–prosodic feature experiments on thisagreed data, ‘‘allspeech (+/�id)’’ (Section 5.1.1),are repeated from Tables 5 and 6.

As shown, the ‘‘text (+/�id)’’ and ‘‘text+all-speech (+/�id)’’ feature sets perform significantlybetter than the ‘‘allspeech (+/�id)’’ feature sets,and the ‘‘text+allspeech+id’’ feature set performsmarginally better than the ‘‘text+id’’ feature set,although with or without identifier features, thesebest-performing feature sets perform statisticallythe same. All of these new ‘‘text’’-inclusive featuresets performed significantly better than the baseline(53%), with the highest relative improvements yetseen, ranging from 51% to 57%. Although notshown, all of the 24 individual acoustic–prosodicfeature sets also significantly outperformed thebaseline when combined with the ‘‘text’’ featureset (with or without ‘‘id’’ features); with accuraciesranging from 76% to 79%, many of these 24 featuresets performed statistically the same as the ‘‘text’’and ‘‘text+allspeech’’ feature sets.

Interestingly, and as exemplified in Table 15,there was little or no statistical difference betweenthe ‘‘+/�id’’ feature sets across the 28 ‘‘text’’-inclu-sive experiments; in most cases the ‘‘+id’’ sets per-formed marginally worse. This implies that the useof lexical items replaces some of the increased pre-dictive power introduced to the speech sets by theidentifier features.

Table 16 presents the best results of these exper-iments run on the 453 emotion-labeled student turnsin our human–human spoken tutoring corpus that

EDPROboth annotators achieved consensus on. For compar-

ison, our best-performing (or statistically compara-ble) feature sets from the acoustic–prosodic featureexperiments on this consensus data, ‘‘allspeech (+/�id)’’ (Section 5.1.2), are repeated from Tables 7and 8.

A comparison of Tables 15 and 16 shows thatoverall, using consensus-labeled data decreased theperformance across feature sets, both with and with-out identifier features. This was also found acrossthe experiments using only acoustic–prosodic fea-tures. Again, however, we see that the ‘‘text (+/�id)’’ and the ‘‘text+allspeech (+/�id)’’ feature setsoutperform the ‘‘allspeech (+/�id)’’ feature sets,and all but ‘‘text+id’’ do so significantly. We alsosee that now both of the ‘‘text+allspeech (+/�id)’’feature sets perform marginally better than the ‘‘text(+/�id)’’ feature sets, although again these best-per-forming feature sets perform statistically the same.Again all the new ‘‘text’’-inclusive feature sets per-formed significantly better than the baseline (60%),although relative improvements are lower thanthose seen using the originally agreed data.Although not shown, all of the individual acous-tic–prosodic feature sets again significantly outper-formed the baseline when combined with the‘‘text’’ feature set (with or without ‘‘id’’ features),with accuracies ranging from 67% to 70%. Now,however, most of these 24 feature sets performedstatistically worse than the ‘‘text+allspeech’’ featuresets. Again though, as exemplified in Table 15, therewas little or no statistical difference between the ‘‘+/�id’’ feature sets in these ‘‘text’’-inclusive experi-ments, and in most cases the ‘‘+id’’ set performedmarginally worse.

In sum, the experiments on the emotion-anno-tated data in our human–human corpus clearlydemonstrate the predictive utility of using even asimple bag-of-words representation of what the stu-

T

OOF

140214031404140514061407

1408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441

144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465

Table 18Feature usage for ‘‘text+allspeech’’, human–human agreed

Lexical items: 74%

which 10% will 5% huh 1%about 9% one 4% alright 1%don�t 8% right 4% I 1%ok 8% yeah 3% the 1%equal 7% be 2% uh 1%but 6% um 2% that 1%

Temporal: 21%

duration 17% tempo 3% prepause 1%

Energy: 3%

stdrms 1% minrms 1% maxrms 1%

Pitch: 2%

maxf0 2%




ORREC

dent said. We further see some increased predictiveability when combining lexical features with acous-tic prosodic features, either with or without identi-fier features. However, within the combined sets,we see a marginal decrease in predictive ability whenidentifier features are added.

6.1.1. Feature usage in human–human corpus

predictions using text

To give us an intuition about how lexical fea-tures, alone and in combination with acoustic–pro-sodic features, are used to predict emotion classes inour human–human tutoring dialogues, we ran thedecision tree algorithm without boosting, for the‘‘text’’ and ‘‘text+allspeech’’ feature sets, for ourhuman–human emotion-labeled agreed data.

Table 17 shows the feature usages for the lexicalitems whose presence or absence was queried whenpredicting the emotion labels using the ‘‘text’’ fea-ture set; the student utterances contained a totalof 379 lexical items, but only 21 of them were que-ried in this decision tree. As shown, among the lex-ical items queried are numerous instances ofgrounding-type phrases (e.g., ‘‘right’’, ‘‘yeah’’,‘‘ok’’), as well as hedging type phrases and filledpauses (‘‘um’’, ‘‘uh’’, ‘‘well’’). In (Bhatt et al.,2004), it is also found that these types of phrasesare used to convey emotion and attitude in a corpusof human–human medical tutoring dialogues.Inspection of this decision tree indicates that it isdifficult to generalize over the paths about how spe-cific lexical items are used to predict the emotionclasses. For example, student turns containing‘‘right’’ but not ‘‘ok’’ are generally predicted to bepositive, while student turns containing ‘‘yeah’’and ‘‘um’’ but not ‘‘right’’ are generally predictedto be negative.

Table 18 shows the feature usages for the 25 fea-tures that were queried when predicting the emotionlabels using the ‘‘text+allspeech’’ feature set.Among the 18 lexical items queried, we see that

UNC 1466

146714681469147014711472147314741475

Table 17Feature usage for ‘‘text’’, human–human agreed

right 12% well 5% it�s 2%yeah 11% huh 5% the 2%is 10% when 4% it�ll 1%um 9% acceleration 4% horizontally 1%that 8% will 3% don�t 1%I 7% uh 3% one 1%what 6% direction 2% ok 1%

EDPRthere are numerous overlaps with the lexical items

appearing in Table 17, particularly with the ground-ing and hedging phrases and filled pauses; however,the frequencies differ significantly across the twodecision trees. There were seven acoustic–prosodicfeatures queried. The temporal features were mostfrequently queried, with duration accounting forthe bulk of the queries. This was also found in thefeature usage for the ‘‘allspeech’’ feature set (Table9); however, there tempo and preceding pause werealso highly queried, while here they are queried farless frequently. In both experiments, energy featuresare queried about equally, although the individualfeatures vary in use and frequency. The most strik-ing change comes from the use of pitch features;here one pitch feature is used, and only rarely, whilein ‘‘allspeech’’ all four pitch features are queriedmuch more frequently.

Inspection of this decision tree shows that alongeach of the 30 paths leading to an emotion label,at least two queries were made of acoustic–prosodicfeatures, and in every case at least one of these que-ries was made of the duration feature. In general, wecontinue to see that student turns with longer dura-tions are predicted negative; the majority of queriesalong each path were made of the lexical items,however. As we saw above, it is difficult to general-ize over the paths about how these specific lexicalitems are used to predict the emotion classes. Forexample, student turns with short durations thatcontain ‘‘right’’ are predicted positive, but if theydon�t contain ‘‘right’’ or ‘‘um’’ or ‘‘huh’’ or ‘‘uh’’and have a slow tempo and a short duration, thenthey are predicted neutral.

T

14761477

147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510

151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546




REC

6.2. Predicting human–computer emotion labels

with lexical information

In our human–computer corpus, we created fourlexical feature sets to represent the transcription ofeach student turn. The first two lexical feature sets,‘‘humtext’’ and ‘‘humtext+id’’ (without and withidentifier features, respectively), represent the human

transcription of each student turn, i.e. the ‘‘ideal’’performance of ITSPOKE with respect to speechrecognition, and correspond to the ‘‘text’’ featuresets in our human–human corpus. The second twolexical feature sets, ‘‘asrtext’’ and ‘‘asrtext+id’’(without and with identifier features, respectively),represent ITSPOKE�s actual best speech recognitionhypothesis of what is said in each student turn.

We performed the same series of machine learn-ing experiments with these four lexical feature setsthat we did in our human–human corpus (Section6.1), i.e. each set alone and in combination withour 26 acoustic–prosodic feature sets, to examinewhether or not the addition of lexical informationimproves the performance of acoustic–prosodic fea-tures, with or without identifier features, on theemotion-labeled data in our human–computer spo-ken tutoring corpus.

Table 19 presents some of the results of theseexperiments when run on the 202 emotion-labeledstudent turns in our human–computer spoken tutor-ing corpus that both annotators originally agreed

on. For comparison, the ‘‘allspeech (+/�id)’’ fea-ture sets from the acoustic–prosodic feature experi-ments (Section 5.2.1), are repeated from Tables 10and 11; in those experiments all 12 individual acous-tic–prosodic feature sets performed statistically thesame as or worse than these ‘‘allspeech’’ feature sets.

UNCOR 1547

15481549155015511552155315541555155615571558155915601561

Table 19%Correct (10X10cv); human–computer agreed; MAJ (neutral) =47%



humtext 53 +/� 2 11humtext+id 68 +/� 2 40

asrtext 58 +/� 2 21asrtext+id 66 +/� 2 36

humtext+allspeech 62 +/� 2 29humtext+allspeech+id 64 +/� 2 32

asrtext+allspeech 61 +/� 3 27asrtext+allspeech+id 62 +/� 3 29

EDPROOF

As we also saw in our human–human experi-ments, all of these new ‘‘text’’-inclusive feature setsperform significantly better than the majority classbaseline (47%), with the highest relative improve-ments yet seen, ranging from 11% to 40%. Althoughnot shown, all of the 24 individual acoustic–pro-sodic feature sets significantly outperformed thebaseline when combined with either the ‘‘asrtext’’or ‘‘humtext’’ feature set (with or without identifierfeatures), with accuracies ranging from 52% to 67%.Many of these 48 feature sets performed statisticallythe same as the ‘‘humtext+allspeech (+/�id)’’ and‘‘asrtext+allspeech (+/�id)’’ feature sets.

In contrast to the human–human data, neither‘‘humtext’’ nor ‘‘asrtext’’ alone significantly outper-form the ‘‘allspeech’’ feature set, they both performstatistically the same as ‘‘allspeech’’. However, weagain see that combining the feature sets yields fur-ther improvement; both ‘‘humtext+allspeech’’ and‘‘asrtext+allspeech’’ outperform the ‘‘allspeech’’feature set, although only ‘‘humtext+allspeech’’does so significantly. In addition, ‘‘humtext+all-speech’’ outperforms ‘‘humtext’’ and ‘‘asrtext+all-speech’’ outperforms ‘‘asrtext’’, although only‘‘humtext+allspeech’’ does this significantly. Thisresult generalizes over much previous research onemotion prediction in naturally occurring human–computer dialogues; Narayanan (2002), Lee et al.(2002), Ang et al. (2002), Batliner et al. (2000), Bat-liner et al. (2003) all found that using a combinationof acoustic–prosodic and lexical information in-creased the predictive ability over and above usingonly acoustic–prosodic or lexical information.

In further contrast to the human–human experi-ments, we see that identifier features sharply im-prove the performance of the ‘‘humtext’’ and‘‘asrtext’’ feature sets, although this sharp increasedisappears when these feature sets combine withthe ‘‘allspeech’’ feature sets. As a result, with identi-fier features added, both of these combined featuresets (‘‘humtext+allspeech+id’’ and ‘‘asrtext+all-speech+id’’) no longer sharply outperform theircomponent ‘‘+id’’ feature sets: ‘‘asrtext+all-speech+id’’ performs statistically the same as both‘‘asrtext+id’’ and ‘‘allspeech+id’’, and ‘‘hum-text+allspeech+id’’ also performs statistically thesame as ‘‘humtext+id’’and ‘‘allspeech+id’’. Interest-ingly, in most cases there was still a significant sta-tistical increase with identifier features added tothe individual acoustic–prosodic features set com-bined with either text feature set.

T

156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597

159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633




REC

Finally, although we hypothesized that the‘‘humtext’’ feature sets would present an upperbound on the performance of the ‘‘asrtext’’ sets, be-cause the human transcription is more accurate thanthe speech recognizer, we see that this is not consis-tently the case. In fact, without identifier features‘‘asrtext’’ significantly outperforms ‘‘humtext’’. Acomparison of the decision trees produced in eithercase, however, does not reveal why; words chosen aspredictors are not very intuitive in either case (e.g.,an example path through a learned ‘‘asr’’ decisiontree says predict negative if the utterance containsthe word second but does not contain the wordsdon�t or zero). Understanding this result is an areafor future research. Within the ‘‘+id’’ sets, we seethat ‘‘humtext’’ and ‘‘asrtext’’ perform statisticallythe same. The utility of the ‘‘humtext’’ features com-pared to ‘‘asrtext’’ increases marginally when com-bined with the ‘‘allspeech’’ features (with andwithout identifiers), although statistically these com-bined feature sets all perform the same.

Table 20 presents some of the results of theseexperiments run on the 333 emotion-labeled studentturns in our human–computer spoken tutoring cor-pus that both annotators achieved consensus on. Forcomparison, the ‘‘allspeech (+/�id)’’ feature setsfrom the acoustic–prosodic feature experiments onthis consensus data (Section 5.2.2) are repeatedfrom Tables 12 and 13.

A comparison of Tables 19 and 20 shows thatoverall, using consensus-labeled data decreased theperformance across feature sets, both with and with-out identifier features. This was consistently foundacross all experiments in this paper. In fact, we seethat ‘‘humtext’’ now performs statistically the sameas the baseline, and as with the human–computer

UNCOR 1634

163516361637163816391640164116421643164416451646164716481649



allspeech 49 +/� 2 –allspeech+id 52 +/� 1 7

humtext 48 +/� 2 –humtext+id 57 +/� 1 17asrtext 51 +/� 2 5asrtext+id 53 +/� 2 10

humtext+allspeech 53 +/� 2 10humtext+allspeech+id 54 +/� 3 11

asrtext+allspeech 53 +/� 1 8asrtext+allspeech+id 54 +/� 1 11

EDPROOF

agreed data, it performs worse than the ‘‘asrtext’’feature set, rather than outperforming them as wefirst hypothesized. Moreover, again we see that nei-ther ‘‘humtext’’ nor ‘‘asrtext’’ alone significantlyoutperform the ‘‘allspeech’’ feature set; both featuresets perform statistically the same as ‘‘allspeech’’.However, we again see that combining the featuresets yields further improvement; both ‘‘hum-text+allspeech’’ and ‘‘asrtext+allspeech’’ outper-form their component feature sets, although notalways significantly.

In contrast to the human–computer agreed data,we see that identifier features moderately improvethe performance of most of the ‘‘text’’-inclusive fea-ture sets; only ‘‘humtext+id’’ displays the sharpimprovement we saw in the agreed data. As a result,this feature set is the best-performing set, and thecombined feature sets (‘‘humtext+allspeech+id’’and ‘‘asrtext+allspeech+id’’) do not outperformtheir component feature sets. However, like wesaw in the human–computer agreed data, in mostcases there was still a significant statistical increasewith identifier features added to the individualacoustic–prosodic features set combined with eithertext feature set. Although not shown, less than halfof the individual acoustic–prosodic feature sets sig-nificantly outperformed the baseline when com-bined with the ‘‘text’’ feature set without identifierfeatures, although all of them significantly outper-formed the baseline when combined with the ‘‘text’’feature set with identifier features. In particular,note that although as discussed in Section 5.2.2,‘‘duration’’ outperforms ‘‘allspeech’’, when com-bined with (‘‘hum’’ or ‘‘asr’’) text features, this isno longer the case; ‘‘allspeech’’ performed statisti-cally the same or better than all of the individualacoustic–prosodic + text feature sets in the consen-sus data.

In sum, these experiments on the emotion-anno-tated data in both our human–human and human–computer corpus clearly demonstrate the predictiveutility of using even a simple bag-of-words represen-tation of what the student said, either in combina-tion with, or as an alternative to, the use ofacoustic–prosodic features. Furthermore, unlikemany of the acoustic–prosodic features, text fea-tures (at least the noisy ‘‘asr’’ versions) have theadvantage of already being available in the systemlogs of ITSPOKE (and in fact of any standard dia-logue system). However, a caveat here is that giventhe domain-dependent nature of many of our lexicalsignals, emotion predictors involving lexical features

T

1650165116521653165416551656165716581659166016611662166316641665166616671668166916701671167216731674167516761677167816791680168116821683168416851686168716881689169016911692169316941695169616971698169917001701

17021703

17041705

170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744

Table 21Feature usage for ‘‘humtext’’, human–computer agreed

are 17% down 10% up 4%is 16% constant 9% increasing 3%yes 14% no 7% vertically 3%increase 11% zero 6%




UNCORREC

would be unlikely to generalize if we wished to portour tutoring system to another domain. As dis-cussed in Section 8, a major focus of our futurework will be to construct a much more sophisticatedset of text-based features, based on analysis of ourdialogue transcriptions. We already have shownfor our human–human corpus that this further im-proves predictive ability (Forbes-Riley and Litman,2004). With respect to the relative utility of lexicalversus acoustic–prosodic features, we hypothesizedthat combining speech and lexical features would re-sult in better performance than either featureset alone. This hypothesis was supported by priorresearch (e.g., Narayanan, 2002; Lee et al., 2002;Ang et al., 2002). In our human–human corpus,we found this to be true for our consensus databut for our agreed data it was only true in the pres-ence of identifier features. In neither case was theimprovement statistically significant, however. Inour human–computer corpus, this was true only inthe absence of identifier features for both our agreedand consensus data, but was only statistically signif-icant in our agreed data. Moreover, in our human–human corpus, using only lexical features (with orwithout identifiers) always produced better perfor-mance than using only speech features (with orwithout identifiers), although not always signifi-cantly. With identifier features added in our hu-man–computer corpus, using only lexical featuresalways produced statistically better performancethan using only speech features, although not al-ways significantly. These results are consistent withothers� findings (e.g., Narayanan, 2002; Lee et al.,2002; Shafran et al., 2003). One might speculate thatthe relative utility of the lexical features might justbe reflective of the annotation process; since theannotators understood English, it was not reallypossible for them to listen to ‘‘how’’ somethingwas said, without also knowing ‘‘what’’ was said.However, when the original annotators revisitedtheir consensus labels and identified the source oftheir decisions for a separate study, they over-whelmingly felt that spoken rather than lexicalinformation led to their labels. For example, in thehuman–computer data, the annotators felt that only20 emotion labels were assigned fully or in part onthe basis of lexical information; the remainder wereassigned based on acoustic–prosodic information.The unintuitive lexical items in our feature usage ta-bles also seem to support the annotators� intuitionsregarding the source of their labeling decisions. Fur-ther hypotheses regarding the predictive utility of

EDPROOF

our lexical features (in particular, of physics-specificwords such as ‘‘increase’’) will be discussed below.

6.2.1. Feature usage in human–computer corpuspredictions using text

To give us an intuition about how lexical fea-tures, alone and in combination with acoustic–pro-sodic features, are used to predict emotion classes inour human–computer tutoring dialogues, we ran thedecision tree algorithm without boosting, for the‘‘humtext’’ and ‘‘humtext+allspeech’’ feature sets,for our human–computer emotion-labeled agreeddata. These feature sets represent ‘‘ideal’’ speechrecognition, and we can also compare these resultswith our human–human results.

Table 21 shows the feature usages for the lexicalitems whose presence or absence was queried whenpredicting the emotion labels using the ‘‘humtext’’feature set; the student utterances contained a totalof 127 lexical items, but only 11 of them were que-ried in this decision tree. As the table exemplifies,in contrast to the human–human condition, we seeno grounding or hedging phrases or filled pausesin the human–computer condition. In (Bhatt et al.,2004), it is also found that the use of these typesof phrases sharply decreases in human–computertutoring, as compared to human–human tutoring.Most of the lexical items in Table 21 are nounsand modifiers specific to tutoring in the physics do-main (e.g., ‘‘constant’’). Similar results are found in(Lee et al., 2002), where ‘‘emotionally salient’’words are computed from a corpus of emotion-la-beled naturally occurring human–computer call cen-ter dialogues involving travel, and then used foremotion prediction experiments; for example, theirpartial list of emotionally salient words includes‘‘baggage’’ and ‘‘delayed’’. These results thus fur-ther illustrate the need for domain-dependent meth-ods of emotion prediction in naturally occurringdata; most of these words will not be contained ingeneric domain-independent databases of emotionalwords (e.g., Siegle, 1994). In our domain, wehypothesize that the physics-dependent words inTable 21 might correspond to concepts that students

T

17451746174717481749175017511752175317541755175617571758175917601761176217631764176517661767176817691770177117721773177417751776177717781779

17801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811

1812




REC

find particularly difficult to understand. We are cur-rently annotating the student turns in our corpuswith respect to ‘‘correctness’’, and plan to investi-gate whether turns containing physics terms predic-tive of emotion are more likely to be ‘‘incorrect’’ or‘‘partially correct’’ compared to other student turns.

Inspection of this decision tree indicates that it isdifficult to generalize over the paths about how spe-cific lexical items are used to predict the emotionclasses. For example, student turns containing‘‘are’’ are predicted to be neutral, while studentturns that don�t contain ‘‘are’’ or ‘‘is’’ but that docontain ‘‘yes’’ are predicted to be negative if theyalso contain ‘‘vertically’’, and are predicted to bepositive if they don�t contain ‘‘vertically’’.

Table 22 shows the feature usages for the 23 fea-tures that were queried when predicting emotionusing the ‘‘humtext+allspeech’’ feature set. Amongthe 15 lexical items queried, we see that there arenumerous overlaps with the lexical items appearingin Table 21, although the frequencies differ signifi-cantly. Again most of these lexical items are specificto tutoring in the physics domain. There were also 8acoustic–prosodic features queried. Among them,the temporal features were most frequently queried,with internal silence and duration accounting forthe bulk of the queries. This was also found in thefeature usage for the ‘‘allspeech’’ feature set (Table14); however, there tempo and preceding pause werealso highly queried, while here they are queried farless frequently. In both experiments, energy featuresare queried about equally, although the individualfeatures vary in use and frequency. The most strik-ing change comes from pitch features; here pitchfeatures are rarely used, while in ‘‘allspeech’’ the

UNCOR 1813

1814181518161817181818191820182118221823182418251826182718281829

Table 22Feature usage for ‘‘humtext+allspeech’’, human–computeragreed

Lexical Items: 65%

don�t 11% decrease 4% the 3%are 10% increase 3% it 2%is 9% same 3% gravity 2%up 6% velocity 3% no 1%down 4% increasing 3% yes 1%

Temporal: 30%

intsilence 12% tempo 5%duration 12% prepause 1%

Energy: 3%

maxrms 2% minrms 1%

Pitch: 2%

meanf0 1% minf0 1%

EDPROOF

mean pitch is queried much more frequently. Whencompared with the corresponding human–humanfeature usage experiments, however, we see thatthe high usage of the duration feature holds acrossthe board; we also see the same relative decreasein the use of the pitch, tempo and preceding pausefeatures after adding the ‘‘text/humtext’’ feature setsto the ‘‘allspeech’’ feature sets. Note that except forthe duration feature, the feature usages here againvary from that in (Ang et al., 2002), further support-ing the need for combining generalizations with do-main-specific emotion prediction in naturallyoccurring data.

Inspection of this decision tree shows that along30 of the 31 paths leading to an emotion label, atleast one query, and on average 3 queries, weremade of acoustic–prosodic features. In every case,one of these queries was made to the internal silencefeature, and in every case where there was morethan one, a query was made to the duration feature.In general, we continue to see that student turnswith longer durations are predicted negative, espe-cially if they have low amounts of internal silence.The majority of queries along each path were madeof the lexical items, however. As we saw above, it isdifficult to generalize over the paths about howthese specific lexical items are used to predict theemotion classes. For example, student turns thatcontain ‘‘don�t’’ are predicted negative. Studentturns that contain ‘‘are’’ but not ‘‘don�t’’ are pre-dicted positive if they have a large amount of inter-nal silence, and are predicted neutral otherwise.

7. Related work

Previous research in the area of emotional speechhas suggested that acoustic and prosodic features ofthe speech signal can be used to develop predictivemodels of user emotional states (cf. Shafran et al.,2003; Batliner et al., 2003; ten Bosch, 2003; Naraya-nan, 2002; Liscombe et al., 2003; Pantic and Rothk-rantz, 2003; Scherer, 2003; Oudeyer, 2002; Anget al., 2002; Lee et al., 2002; Cowie et al., 2001; Lit-man et al., 2001; Lee et al., 2001; Batliner et al.,2000; Polzin and Waibel, 1998). While our workbuilds on this prior work, our focus on emotion rec-ognition as a first step towards building an adaptivedialogue tutoring system yields differences in type oftraining data, granularity and relativity of the emo-tion annotation scheme, and scope of feature set.

First, previous studies have often used speechread by actors as training data (often with semanti-

T

183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874

187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919




NCORREC

cally neutral content) (Oudeyer, 2002; Polzin andWaibel, 1998; Liscombe et al., 2003), even thoughsuch prototypical emotional speech does not neces-sarily reflect naturally occurring speech with mean-ingful content (Fischer, 1999), e.g., as found intutoring dialogues. While other dialogue researchershave also started to use corpora of naturally occur-ring speech for training (Batliner et al., 2003;Narayanan, 2002; Ang et al., 2002; Lee et al.,2002; Litman et al., 2001; Batliner et al., 2000; Leeet al., 2001), little work to date has addressed emo-tion recognition in educational settings such as ourtutoring domain. As our results in this paper haveshown, while some generalizations about the fea-tures useful to emotion prediction can be madeacross naturally occurring corpora, to some extentthe specific usefulness of emotion predictors is do-main-specific (e.g., lexical information). In addition,we examined and contrasted our findings across twotypes of spoken dialogue corpora: one with a com-puter tutor, and the other with a human tutor. Giventhe current limitations of human language technol-ogies, computer tutors are far less flexible than hu-man tutors, and also make more errors; the use ofhuman tutors provides a benchmark for estimatingthe performance of an ‘‘ideal’’ computer systemwith respect to language technologies. The fact thatour results often differed across our human andcomputer corpora suggests the importance of train-ing current systems from appropriate data (human–computer data in our case). The use of human tu-tors, however, provides a benchmark for under-standing how our results might change as speechand language technologies continue to improve.

Furthermore, the emotion categories studied innaturally occurring dialogues have typically beensimpler than those used in studies based on actor-read speech, due to the need to first manually labelsuch emotion classes reliably across annotators inthe naturally occurring corpora. Lee et al. (2001)and Batliner et al. (2000) annotate two emotionclasses (negative/non-negative and emotional/non-emotional, respectively), Ang et al. (2002) annotate6 emotion classes but only use a negative/other dis-tinction in their machine learning experiments,32
U 1920
32 Ang et al. (2002) also discuss the use of an ‘‘uncertainty’’label, although it did not improve inter-annotator agreement. In(Litman and Forbes-Riley, 2004a) we discuss the use of ‘‘weak’’labels, which are more similar to an ‘‘intensity’’ dimension foundin studies of elicited speech (cf. Cowie et al., 2001).

EDPROOF

while we employ a three-way distinction betweennegative, neutral, and positive emotional classesfor both our annotation scheme and our machinelearning experiments.33 In contrast, studies ofspeech generated by actors pretending to be in oneof a set of predefined emotion classes typically em-ploy more complex categorizations, e.g., five emo-tion classes (happy, afraid, angry, sad, neutral) asin (Polzin and Waibel, 1998), and even 10 emotionclasses for actor-read dates and numbers as in (Lis-combe et al., 2003). In further contrast to somestudies of naturally occurring dialogue (e.g., Leeet al., 2001), but similarly to others (e.g., Anget al., 2002), our annotations are context-relative(e.g., relative to the other turns in the dialogue),and task-relative (e.g., relative to tutoring), becauseas prior work suggests, the range and strength ofemotional expression differs across domains (andthis may account in part for why different featuresare more useful as emotion predictors in differentdomains) (Cowie and Cornelius, 2003). Anothercomparison is with the work of Batliner et al.(2003) (also see Fischer (1999)), where annotatorslabel occurrences of a set of formal prosodic, lexicaland dialogue properties that can be shown to bedependent on the speaker�s changing attitude (Bat-liner et al., 2003, 125). In contrast, we allow annota-tors to be guided by their intuition rather than a setof formal features, to avoid restricting or otherwiseinfluencing the annotator�s intuitive understandingof emotion expression, and because such featuresmay not be used consistently or unambiguouslyacross speakers. Instead of associating specific fea-tures with specific emotions, our manual containsannotated audio-enhanced corpus examples, as inFigs. 2 and 3. The features mentioned as providingevidence for specific emotions in those examples(e.g., that a lexical expression of certainty was onecontributor to the positive label in Fig. 2) were elic-ited during post-annotation discussion, for exposi-tory use in this paper.

Finally, like other researchers who study emotiondetection in naturally occurring speech with the ulti-mate goal of building adaptive systems (e.g., Leeet al., 2001), we only investigate the predictive utilityof features that can be computed fully automatically

33 We have also explored conflating our positive and neutralclasses and our positive and negative classes, to yield a negative/non-negative and an emotional/non-emotional distinction (Lit-man and Forbes-Riley, 2004b).

T

1921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972

197319741975197619771978197919801981198219831984198519861987198819891990

1991

1992199319941995199619971998199920002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022




UNCORREC

from the student turn and that will be available toour dialogue system in real-time. Examining theutility of acoustic–prosodic or lexical features thathave previously been shown to be useful but thatwould require manual labeling (e.g., ToBI-labelings(Liscombe et al., 2003)) is currently outside thescope of our research. However, as discussed furtherin Section 8, we are expanding our initial set ofacoustic–prosodic and lexical features to includeadditional automatically available turn, word, anddialogue level features used in the literature or avail-able in ITSPOKE logs, and we will also examinebetter methods of feature selection.

In the larger arena of user modeling there hasalso been increasing interest in modeling and adapt-ing to user emotion and affect (de Rosis, 2001b,2002; Conati et al., 2003b; de Rosis, 2001a, 1999).While some of this research overlaps with the areaof emotional speech (e.g., Mozziconacci, 2001),other language-based work has been solely text-based (Cavalluzzi et al., 2003). In addition, muchuser-modeling research has focused on non-linguis-tic sources of information for emotion recognitionsuch as physiological signals (Picard et al., 2001).Recent studies from the domain of educationalapplications, for example, have used data from bio-metric sensors (Biddle et al., 2003; Conati et al.,2003a), videos of facial expressions (Fan et al.,2003) and eye-gaze (Kort et al., 2001). As human–computer interactions become more multimodal, itis expected that emotions will increasingly be de-tected using multiple features from different sensors(Pantic and Rothkrantz, 2003). As discussed above,our current work is starting to explore the integra-tion of two sources of linguistic input (speech andtext), and is thus one small step towards this goal.We hope that other groups will start integratingnon-linguistic signals (e.g., facial expressions, bodymovements, and physiological reactions) with lan-guage-based features, in order to further improvethe ability to model and adapt to user emotions intutoring as well as other types of spoken dialoguesystems.

Finally, although not motivated by emotionmodeling, with recent advances in speech technol-ogy, several other tutorial dialogue research projectshave begun to incorporate basic spoken languagecapabilities into their dialogue systems (Mostowand Aist, 2001; Schultz et al., 2003; Graesseret al., 2001a; Rickel and Johnson, 2000). Speech isthe most natural and easy to use form of naturallanguage interaction, and studies have already

EDPROOF

shown potential benefits of speech in the contextof dialogue tutoring. For example, in human–hu-man tutoring, spontaneous self-explanation im-proves learning gains (Chi et al., 1994), andspontaneous self-explanation occurs much more fre-quently in spoken tutoring than in text-based tutor-ing (Hausmann and Chi, 2002). Moreno et al.(2001) have found that in human–computer tutor-ing, the use of an interactive pedagogical agent thatcommunicates using speech rather than text outputimproves student learning, while the visual presenceor absence of the agent does not. Our work in IT-SPOKE is also concerned with the investigation ofsuch issues. Preliminary results suggest that forour human tutoring task, the use of speech ratherthan text yields slightly improved student learning,with significantly less time on task (Litman et al.,2004).

8. Conclusions and current directions

We have shown how information from studentutterances can be used to automatically recognizestudent emotions and attitudes. In particular, wepresented an empirical study addressing the use ofmachine learning techniques for automatically rec-ognizing student emotional states in two corporaof spoken tutoring dialogues, one with a human tu-tor, and one with a computer tutor. Our methodol-ogy extends the results of prior research on emotionrecognition in spoken dialogue systems by applyingthem to these two versions (human–human and hu-man–computer) of our new tutoring domain. Wefirst annotated student turns in our two corporafor emotional states. Our annotation scheme distin-guishes negative, neutral and positive emotions, andour inter-annotator agreement is comparable toprior emotion annotation results in other types ofnaturally occurring corpora. Using both originallyagreed and consensus-labeled versions of our anno-tated student turns, we then automatically extracted12 acoustic and prosodic features that are availablein real-time to ITSPOKE, our intelligent tutoringspoken dialogue system. We examined the use ofeach of these features alone, in combination with3 student and task dependent ‘‘identifier’’ featuresfor student, gender, and problem, and in combina-tion with lexical items from the manually tran-scribed or recognized student speech. While ourhighest prediction accuracies were typically ob-tained by combining many features, we also sawthat simpler models often worked nearly as well in

T

2023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074

20752076207720782079208020812082208320842085208620872088208920902091209220932094209520962097209820992100210121022103210421052106210721082109211021112112211321142115211621172118211921202121

34 We thank our reviewers for pointing out the impact of thetutor�s emotional state on student emotional states.




UNCORREC

many experiments, and that such models might haveother advantages that could be traded-off with accu-racy (e.g., ease of implementation, domain-indepen-dence). Furthermore, the relative predictive utilityof specific feature sets often differed across our hu-man–human and human–computer corpora, high-lighting the importance of training computationalsystems from appropriate data.

Across all experiments, we found that using con-sensus-labeled data consistently decreased predic-tive ability, as compared to using originally agreedemotion labels only. We also found across all exper-iments that turn duration was always among themost useful acoustic–prosodic predictors of emo-tion, and that longer durations were generally pre-dictive of the negative emotion class. Although ingeneral there was a trend for temporal features toprovide the most predictive utility, followed by en-ergy features and lastly, pitch features, the useful-ness of particular acoustic–prosodic featuresvaried across experiments; indeed across prior re-search more generally, the usefulness of particularacoustic–prosodic features is often domain-depen-dent. Considering the impact of identifier features,we found that adding them to our acoustic–prosodicfeature sets did not significantly improve perfor-mance in our human–human corpus, but in our hu-man–computer corpus, acoustic–prosodic featuresin combination with identifier features are signifi-cantly more useful predictors than using acoustic–prosodic features alone. With respect to using lexi-cal information to predict emotion, our experimentsin both our human–human and human–computercorpus clearly demonstrate the predictive utility ofusing even a simple bag-of-words representation ofwhat the student said, either in combination withor as an alternative to using acoustic–prosodic fea-tures. In both corpora, without identifier features,we found that predictive ability was maintained orincreased when acoustic–prosodic and lexical infor-mation was combined. However, when we addedidentifier features to these combined feature sets,only in our human–human corpus did we continueto see that the combined feature sets outperformedtheir component sets. Surprisingly, however, wedid not see a consistent decrease in predictive utilitywhen using recognized speech instead of human-transcribed speech in our human–computer corpus.Finally, we found across all experiments that ourbest results in our human–computer corpus areworse than in our human–human corpus, with re-spect to both raw accuracy and relative improve-

EDPROOF

ments in error reduction over majority classbaselines. This, along with the lower baseline inthe human–computer corpus, suggests that recog-nizing emotion in human–computer tutoring dia-logues is the more difficult task. We havehypothesized that this is due in part to the limita-tions of the computer tutor, which among other ef-fects, causes students to speak less and thus provideless information about their emotional states. An-other factor which may influence student emotionalstate and deserves future study, is the tutor�s emo-tional state. For example, a more ‘‘caring’’ tutormay elicit different emotional responses than a more‘‘stoic’’ one. That our computer tutor is ‘‘emotion-less’’ may thus provoke less emotions on the partof the student.34

Our results provide a first step towards enhanc-ing our intelligent tutoring spoken dialogue systemto automatically recognize and adapt to studentstates. While the results of all of our experimentsshow significant improvements in predictive accu-racy compared to majority class baselines, there isstill room for further improvement. A major focusof our future work in increasing predictive accuracywill be to expand our feature sets. We are expandingour initial set of acoustic–prosodic features to in-clude additional turn level features used in the liter-ature (e.g., pitch slopes at various locations in theturn (Ang et al., 2002) as well as word-level analysissuch as found in (Batliner et al., 2003)). We will alsoconstruct a more sophisticated set of text-based fea-tures (e.g., constructed using language models, wordlattices, and emotionally salient words), and we willconstruct a set of dialogue-based features, includinginformation available in ITSPOKE logs. Research-ers (Lee et al., 2002; Ang et al., 2002; Litmanet al., 2001; Batliner et al., 2000) have shown thatsuch features can contribute to emotion recognitionin other domains, and we already have shown forour human–human corpus that such features fur-ther improve predictive ability (Forbes-Riley andLitman, 2004), although not all of the same featuresare automatically available in ITSPOKE.

Another major focus of our future work is toinvestigate better methods of feature selection suchas used in other research (e.g., Ang et al., 2002;Lee et al., 2002; Narayanan, 2002). The decreases

T

212221232124212521262127212821292130213121322133213421352136213721382139214021412142214321442145214621472148214921502151215221532154215521562157215821592160216121622163

216421652166216721682169217021712172217321742175217621772178217921802181218221832184

2185

2186218721882189

2190

2191219221932194219521962197219821992200220122022203




CORREC

in predictive ability found across some of our exper-iments when adding features may indicate that thereis some redundancy within the combined featuresets; moreover the high-dimensionality of our lexicalitem ‘‘dictionaries’’ leads to a data-sparseness prob-lem for learning the presence or absence of lexicalitems. Although the machine learning algorithmwe employed automatically applies feature selec-tion, it appears that a more robust method wouldbe preferable. Another explanation for the decreasesin predictive ability found when adding features isthat the annotated data sets are too small and notrepresentative of the larger corpora; examinationof the learning curves for some of our experimentssuggests that predictive accuracy is still rising forour current amount of training data. We are cur-rently annotating more data to test this; in addition,we are exploring partial automation of emotionannotation via semi-supervised machine learning(Maeireizo et al., 2004). With more data we can alsotry methods such as down-sampling to balance thedata in each emotion class (Ang et al., 2002).35

Finally, we will explore how the recognized emo-tions can be used to improve system performance.Our best results in our human–computer corpusindicate that although there is still room forimprovement with respect to predicting studentemotions in our intelligent tutoring domain,improving the system so that it can successfullyadapt to emotion is a feasible application. In partic-ular, examination of the distribution of error in ourbest human–computer result (excluding those re-sults relying on a human transcription)—which isthe asrtext+id feature set with 66% accuracy—shows that most of the error involves incorrectlytreating negative and positive turns as neutral. Infact this is what the non-adaptive version of our sys-tem currently does. Only approximately 20% of theerror involves wrongly labeling a student turn as apositive or negative emotion.

It is however important to note that our recog-nized emotions, though labeled according to their

UN 2204

220522062207220822092210221122122213

35 Since our corpora are small, an alternative to emotionprediction would be to use a rule-based approach instead ofmachine learning. However, our choice to use machine learningwas motivated by prior work (Litman, 1996) where we showedthat machine learning produced better results than a rule-basedapproach in a small corpus. Others (e.g., DiEugenio et al., 1997)have also showed good results using machine learning on smallcorpora.

EDPROOF

valence as e.g., ‘‘negative’’ or ‘‘positive’’ like inother annotation schemes for naturally occurringdialogue, are not necessarily in a direct relationshipwith system adaptation. For, in sharp contrast toe.g., call center systems, where ‘‘negative’’ user emo-tions are clearly detrimental to the overall perfor-mance of the system, in our domain it is still anopen question as to how emotions and learninginterrelate (cf. Craig and Graesser, 2003). Our nextstep towards enhancing our intelligent tutoring spo-ken dialogue system to automatically recognize andadapt to student states is thus to develop techniquesfor adapting to student emotions in a way that im-proves their overall learning. To develop such adap-tive techniques, we have annotated human tutor

responses to the labeled emotions in our human–hu-man tutoring corpus (Forbes-Riley et al., 2005). Wecan combine analyses of this data with the rich bodyof work that concerns the relationship betweentutoring methods and student learning (e.g., Haus-mann and Chi, 2002).

Acknowledgements

This research is supported by NSF Grants9720359 and 0328431. Thanks to the Why2-Atlasteam and S. Silliman for system design and datacollection.

References

Aist, G., Kort, B., Reilly, R., Mostow, J., Picard, R., 2002.Adding human-provided emotional scaffolding to an auto-mated reading tutor that listens increases student persistence.In: Proc. Intelligent Tutoring Systems (ITS), p. 992.

Aleven, V., Rose, C.P. (Eds.), July 2003. Proc. AIED 2003Workshop on Tutorial Dialogue Systems: With a Viewtoward the Classroom, Sydney, Australia.

Aleven, V., Popescu, O., Koedinger, K., 2001. Towards tutorialdialog to support self-explanation: adding natural languageunderstanding to a cognitive tutor. In: Moore, J.D., Redfield,C.L., Johnson, W.L. (Eds.), Proc. Artificial Intelligence inEducation, pp. 246–255.

Ang, J., Dhillon, R., Krupski, A., Shriberg, E., Stolcke, A., 2002.Prosody-based automatic detection of annoyance and frus-tration in human–computer dialog. In: Proc. InternationalConf. on Spoken Language Processing (ICSLP), pp. 203–207.

Batliner, A., Fischer, K., Huber, R., Spilker, J., Noth, E., 2000.Desperately seeking emotions: Actors, wizards, and humanbeings. In: ISCAWorkshop on Speech and Emotion, pp. 195–200.

Batliner, A., Fischer, K., Huber, R., Spilker, J., Noth, E., 2003.How to find trouble in communication. Speech Commun. 40,117–143.

T

22142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275

2276227722782279228022812282228322842285228622872288228922902291229222932294229522962297229822992300230123022303230423052306230723082309231023112312231323142315231623172318231923202321232223232324232523262327232823292330233123322333233423352336




UNCORREC

Bhatt, K., Evens, M., Argamon, S., 2004. Hedged responses andexpressions of affect in human/human and human/computertutorial interactions. In: Proc. Cognitive Science.

Biddle, E.S., Malone, L., McBride, D., July 2003. Objectivemeasurement of student affect to optimize automated instruc-tion. In: Conati, C., Hudlicka, E., Lisetti, C. (Eds.), Proc. 3rdUser Modeling Workshop on Assessing and Adapting to UserAttitudes and Effect: Why, When, and How? Johnstown, PA,pp. 65–68.

Black, A., Taylor, P., 1997. Festival speech synthesis system:system documentation (1.1.1). The Centre for Speech Tech-nology Research, University of Edinburgh. Available from:<http://www.cstr.ed.ac.uk/projects/festival/>.

Carletta, J., 1996. Assessing agreement on classification tasks: thekappa statistic. Comput. Linguistics 22 (2).

Cavalluzzi, A., Carolis, B.D., Carofiglio, V., Grassano, G., 2003.Emotional dialogs with an embodied agent. In: Proc. UserModeling Conference. Johnstown, PA, pp. 86–95.

Chi, M., Leeuw, N.D., Chiu, M.-H., Lavancher, C., 1994.Eliciting self-explanations improves understanding. CognitiveSci. 18, 439–477.

Cohen, J., 1960. A coefficient of agreement for nominal scales.Educat. Psychol. Measurement 20, 37–46.

Cohen, J., 1968. Weighted kappa: nominal scale agreement withprovision for scaled disagreement or partial credit. Psychol.Bull. 70, 213–220.

Coles, G., 1999. Literacy, emotions, and the brain. ReadingOnline, March 1999, <http://www.readingonline.org/critical/coles.html>.

Conati, C., Chabbal, R., Maclaren, H., July 2003a. A study onusing biometric sensors for monitoring user emotions ineducational games. In: Conati, C., Hudlicka, E., Lisetti, C.(Eds.), Proc. 3rd User Modeling Workshop on Assessing andAdapting to User Attitudes and Effect: Why, When, andHow? Johnstown, PA, pp. 16–22.

Conati, C., Hudlicka, E., Lisetti, C. (Eds.), July 2003b. Proc. 3rdUser Modeling Workshop on Assessing and Adapting to UserAttitudes and Effect: Why, When, and How? Johnstown, PA.

Cowie, R., Cornelius, R.R., 2003. Describing the emotional statesthat are expressed in speech. Speech Commun. 40, 5–32.

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G.,Kollias, S., Fellenz, W., Taylor, J., 2001. Emotion recognitionin human–computer interaction. IEEE Signal Process. Mag.18 (January), 32–80.

Craig, S.D., Graesser, A., 2003. Why am I confused: anexploratory look into the role of affect in learning. In:Mendez-Vilas, A., Gonzalez, J. (Eds.), Advances in Technol-ogy-based Education: Towards a Knowledge-based Society,Vol. 3, pp. 1903–1906.

de Rosis, F. (Ed.), June 1999. Proc. User Modeling Workshop onAttitude, Personality and Emotions in User-Adapted Inter-action, Alberta, Canada.

de Rosis, F. (Ed.), July 2001a. Proc. User Modeling Workshopon Attitude, Personality and Emotions in User-AdaptedInteraction, Sonthofen, Germany.

de Rosis, F. (Ed.), 2001b. Special Issue on User Modeling andAdaptation in Affective Computing, Vol. 11(4).

de Rosis, F. (Ed.), 2002. Special Issue on User Modeling andAdaptation in Affective Computing, Vol. 12(1).

Devillers, L., Lamel, L., Vasilescu, I., 2003. Emotion detection intask-oriented spoken dialogs. In: Proc. IEEE Internat. Conf.on Multimedia & Expo (ICME).

EDPROOF

DiEugenio, B., Glass, M., 2004. The kappa statistic: a secondlook. Comput. Linguistics 30 (1).

DiEugenio, B., Moore, J.D., Paolucci, M., 1997. Learningfeatures that predict cue usage. In: Proc. 35th AnnualMeeting of the Association for Computational Linguistics(ACL97), Madrid, Spain.

Evens, M., Brandle, S., Chang, R.-C., Freedman, R., Glass, M.,Lee, Y.H., Shim, L.S., Woo, C.W., Zhang, Y., Zhou, Y.,Michaeland, J.A., Rovick, A.A., 2001. Circsim-tutor: anintelligent tutoring system using natural language dialogue.In: Proc. Twelfth Midwest AI and Cognitive Science Confer-ence, MAICS 2001, Oxford, OH, pp. 16–23.

Fan, C., Johnson, M., Messom, C., Sarrafzadeh, A., 2003.Machine vision for an intelligent tutor. In: Proc. 2ndInternational Conference on Computational Intelligence,Robotics and Autonomous Systems (CIRAS), Singapore.

Fischer, K., December 1999. Annotating emotional languagedata. Verbmobil Report 236.

Forbes-Riley, K., Litman, D., 2004. Predicting emotion inspoken dialogue from multiple knowledge sources. In: Proc.Human Language Technology Conf.: 4th Meeting of theNorth American Chap. of the Assoc. for ComputationalLinguistics (HLT/NAACL), pp. 201–208.

Forbes-Riley, K., Litman, D., Huettner, A., Ward, A., 2005.Dialogue-learning correlations in spoken dialogue tutoring.In: Proc. Internat. Conf. on Artificial Intelligence inEducation.

Fox, B.A., 1993. The Human Tutorial Dialogue Project. Law-rence Erlbaum.

Freund, Y., Schapire, R., 1996. Experiments with a new boostingalgorithm. In: Proc. Internat. Conf. on Machine Learning, pp.148–156.

Graesser, A., Person, N., Harter, D., 2001a. Teaching tactics anddialog in AutoTutor. Internat. J. Artificial Intell. Educat. 12(3), 257–279.

Graesser, A., VanLehn, K., Rose, C., Jordan, P., Harter, D.,2001b. Intelligent tutoring systems with conversational dia-logue. AI Mag. 22 (4), 39–51.

Hausmann, R., Chi, M., 2002. Can a computer interface supportself-explaining?. Internat. J. Cognitive Technol. 7 (1) 4–14.

Huang, X.D., Alleva, F., Hon, H.W., Hwang, M.Y., Lee, K.F.,Rosenfeld, R., 1993. The SphinxII speech recognition system:an overview. Computer, Speech and Language 2, 137–148.

Izard, C.E., 1984. Emotion-cognition relationships and humandevelopment. In: Izard, C., Kagan, J., Zajonc, R. (Eds.),Emotions, Cognition, and Behavior. Cambridge UniversityPress, New York, pp. 17–37.

Jordan, P., VanLehn, K., 2002. Discourse processing for explan-atory essays in tutorial applications. In: Proc. 3rd SIGdialWorkshop on Discourse and Dialogue, pp. 74–83.

Jordan, P., Makatchev, M., VanLehn, K., 2003. Abductivetheorem proving for analyzing student explanations. In:Hoppe, U., Verdejo, F., Kay, J. (Eds.), Proc. ArtificialIntelligence in Education. IOS Press, pp. 73–80.

Jordan, P.W., Makatchev, M., VanLehn, K., 2004. Combiningcompeting language understanding approaches in an intelli-gent tutoring system. In: Proc. Intelligent Tutoring SystemsConference (ITS), pp. 346–357.

Kort, B., Reilly, R., Picard, R.W., 2001. An affective model ofinterplay between emotions and learning: reengineering edu-cational pedagogy—building a learning companion. In:

http://www.cstr.ed.ac.uk/projects/festival/

http://www.readingonline.org/critical/coles.html

http://www.readingonline.org/critical/coles.html

T

23372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398

2399240024012402240324042405240624072408240924102411241224132414241524162417241824192420242124222423242424252426242724282429243024312432243324342435243624372438243924402441244224432444244524462447244824492450245124522453245424552456245724582459




UNCORREC

Internat. Conf. on Advanced Learning Technologies(ICALT), pp. 43–48.

Krippendorf, K., 1980. Content Analysis: An Introduction to itsMethodology. Sage Publications, Beverly Hills, CA.

Landis, J.R., Koch, G.G., 1977. The measurement of observeragreement for categorical data. Biometrics 33, 159–174.

Lee, C., Narayanan, S., Pieraccini, R., 2001. Recognition ofnegative emotions from the speech signal. In: Proc. IEEEAutomatic Speech Recognition and Understanding Work-shop (ASRU).

Lee, C., Narayanan, S., Pieraccini, R., 2002. Combining acousticand language information for emotion recognition. In: Proc.Internat. Conf. on Spoken Language Processing (ICSLP).

Liscombe, J., Venditti, J., J.Hirschberg, 2003. Classifying subjectratings of emotional speech using acoustic features. In: Proc.EuroSpeech.

Litman, D., 1996. Cue phrase classification using machinelearning. J. Artificial Intell. Res. 5, 53–94.

Litman, D., Forbes, K., 2003. Recognizing emotion from studentspeech in tutoring dialogues. In: Proc. IEEE AutomaticSpeech Recognition and Understanding Workshop (ASRU),pp. 25–30.

Litman, D., Forbes-Riley, K., 2004a. Annotating student emo-tional states in spoken tutoring dialogues. In: Proc. 5thSIGdial Workshop on Discourse and Dialogue, pp. 144–153.

Litman, D.J., Forbes-Riley, K., 2004b. Predicting student emo-tions in computer–human tutoring dialogues. In: Proc.Association Computational Linguistics (ACL), pp. 352–359.

Litman, D., Silliman, S., 2004. ITSPOKE: An intelligent tutoringspoken dialogue system. In: Proc. Human Language Tech-nology Conf.: 4th Meeting of the North American Chap. ofthe Assoc. for Computational Linguistics (HLT/NAACL)(Companion Volume), pp. 233–236.

Litman, D., Hirschberg, J., Swerts, M., 2001. Predicting userreactions to system error. In: Proc. Association of Compu-tational Linguistics (ACL), pp. 362–369.

Litman, D., Rose, C.P., Forbes-Riley, K., VanLehn, K., Bhembe,D., Silliman, S., 2004. Spoken versus typed human andcomputer dialogue tutoring. In: Proc. Internat. Conf. onIntelligent Tutoring Systems (ITS), pp. 368–379.

Maeireizo, B., Litman, D., Hwa, R., 2004. Co-training forpredicting emotions with spoken dialogue data. In: Compan-ion Proc. Association for Computational Linguistics (ACL),pp. 203–206.

Masters, J.C., Barden, R.C., Ford, M.E., 1979. Affective states,expressive behavior, and learning in children. J. PersonalitySoc. Psychol. 37, 380–390.

Moreno, R., Mayer, R.E., Spires, H.A., Lester, J.C., 2001. Thecase for social agency in computer-based teaching: dostudents learn more deeply when they interact with animatedpedagogical agents. Cognit. Instruct. 19 (2), 177–213.

Mostow, J., Aist, G., 2001. Evaluating tutors that listen: anoverview of Project LISTEN. In: Forbus, K., Feltovich, P.(Eds.), Smart Machines in Education. MIT/AAAI Press, pp.169–234.

Mozziconacci, S.J.L., 2001. Modeling emotion and attitude inspeech by means of perceptually based parameter values. UserModel. User-Adapted Interact. 11 (4), 297–326.

Narayanan, S., 2002. Towards modeling user behavior in human-machine interaction: effect of errors and emotions. In: Proc.ISLE Workshop on Dialogue Tagging for Multi-modalHuman Computer Interaction.

EDPROOF

Nasby, W., Yando, R., 1982. Selective encoding and retrieval ofaffectively valent information. J. Personality Soc. Psychol. 43,1244–1255.

Oudeyer, P.-Y., 2002. The production and recognition ofemotions in speech: features and algorithms. Internat. J.Human Comput. Studies 59 (1–2), 157–183.

Pantic, M., Rothkrantz, L.J.M., 2003. Toward an affect-sensitivemultimodal human–computer interaction. In: Proc. IEEEConf., Vol. 91(9), pp. 1370–1390.

Picard, R.W., Vyzas, E., Healey, J., 2001. Toward machineemotional intelligence: analysis of affective physiologicalstate. IEEE Trans. Pattern Anal. Mach. Intell. 23 (10).

Polzin, T.S., Waibel, A.H., 1998. Detecting emotions in speech.In: Proc. Cooperative Multimodal Communication.

Potts, R., Morse, M., Felleman, E., Masters, J.C., 1986.Children�s emotions and memory for affective narrativecontent. Motivat. Emot. 10, 39–57.

Rickel, J., Johnson, W.L., 2000. Task-oriented collaboration withembodied agents in virtual worlds. In: Cassell, J., Sullivan, J.,Prevost, S., Churchill, E. (Eds.), Embodied ConversationalAgents. MIT Press, pp. 95–122.

Rose, C.P., 2000. A framework for robust sentence levelinterpretation. In: Proc. First Meeting of the North AmericanChapter of the Association for Computational Lingusitics,pp. 1129–1135.

Rose, C.P., Jordan, P., Ringenberg, M., Siler, S., VanLehn, K.,Weinstein, A., 2001. Interactive conceptual tutoring in Atlas-Andes. In: Proc. Artificial Intelligence in Education, pp. 256–266.

Rose, C., 2005. Personal communication.Rose, C.P., Aleven, V. (Eds.), June 2002. Proc. ITS 2002

Workshop on Empirical Methods for Tutorial DialogueSystems, San Sebastian, Spain.

Rose, C.P., Freedman, R. (Eds.), 2000. AAAI Working Notes ofthe Fall Symposium: Building Dialogue Systems for TutorialApplications.

Russell, J.A., Bachorowski, J., Fernandez-Dols, J., 2003. Facialand vocal expressions of emotion. Ann. Rev. Psychol. 54, 29–349.

Scherer, K.R., 2003. Vocal communication of emotion: a reviewof research paradigms. Speech Communicat. 40, 227–256.

Schultz, K., Bratt, E.O., Clark, B., Peters, S., Pon-Barry, H.,Treeratpituk, P., 2003. A scalable, reusable spoken conversa-tional tutor: SCoT. In: AIED Supplementary Proceedings,pp. 367–377.

Seipp, B., 1991. Anxiety and academic performance: a meta-analysis of findings. Anxiety Res. 4, 27–41.

Shafran, I., Riley, M., Mohri, M., 2003. Voice signatures. In:Proc. IEEE Automatic Speech Recognition and Understand-ing Workshop (ASRU), pp. 31–36.

Shah, F., Evens, M., Michael, J., Rovick, A., 2002. Classifyingstudent initiatives and tutor responses in human–humankeyboard-to-keyboard tutoring sessions. Discourse Process.33 (1).

Siegel, S., Castellan, N., John, J., 1988. Nonparametric Statisticsfor the Behavioral Sciences, second ed. McGraw-Hill, NewYork.

Siegle, G.J., 1994. The balanced affective word list project.Available from: <http://www.sci.sdsu.edu/CAL/wordlist/>.

ten Bosch, L., 2003. Emotions, speech and the ASR framework.Speech Commun. 40, 213–225.

http://www.sci.sdsu.edu/CAL/wordlist/

24602461246224632464246524662467

2468246924702471247224732474




VanLehn, K., Jordan, P.W., Rose, C., Bhembe, D., Bottner, M.,Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg,M., Roque, A., Siler, S., Srivastava, R., Wilson, R., 2002. Thearchitecture of Why2-Atlas: a coach for qualitative physicsessay writing. In: Proc. 6th Internat. Intell. Tutoring Syst.Conf., pp. 158–167.

Witten, I.H., Frank, E., 1999. Data Mining: Practical MachineLearning Tools and Techniques with Java implementations.

UNCORRECT

Morgan Kaufmann, San Francisco. Available from: <http://www.cs.waikato.ac.nz/ml/>.

Zinn, C., Moore, J.D., Core, M.G., June 2002. A 3-tier planningarchitecture for managing tutorial dialogue. In: Proc. Intell.Tutoring Syst. Conf. (ITS 2002), Biarritz, France, pp. 574–584.

EDPROOF

http://www.cs.waikato.ac.nz/ml/

http://www.cs.waikato.ac.nz/ml/

recognizing student emotions and attitudes on the basisof...

Documents