how do you feel? - university of southern...

30
User emotion detection Nikolaos Malandrakis Introduction Motivation Definition and Representation Common issues Modalities Text Speech Audio Facial features Body gestures Biometrics Multimodal fusion Papers Emovoice Toward Prosody Multiple sources Conclusions How do you feel? User emotion detection Nikolaos Malandrakis Signal Analysis and Interpretation Laboratory (SAIL), USC, Los Angeles, CA 90089, USA Nikolaos Malandrakis User emotion detection

Upload: nguyenquynh

Post on 28-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

How do you feel?

User emotion detection

Nikolaos Malandrakis

Signal Analysis and Interpretation Laboratory (SAIL), USC, Los Angeles, CA90089, USA

Nikolaos Malandrakis User emotion detection

Page 2: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Table of Contents

1 IntroductionMotivationDefinition and RepresentationCommon issues

2 ModalitiesTextSpeech AudioFacial featuresBody gesturesBiometricsMultimodal fusion

3 PapersEmovoiceTowardProsodyMultiple sources

4 ConclusionsNikolaos Malandrakis User emotion detection

Page 3: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Why should I care?

1 Association with other attributessatisfactionengagement / attentionbelievability

2 Target of applicationPsychoanalysis

3 Predict user perceptionHow the agent sounds

4 PublicationsYou want to get published, a lot

Nikolaos Malandrakis User emotion detection

Page 4: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

What is an emotion?

1 Darwinian perspectiveEmotions serve evolutionary functionEmotion as motivationInstinct

2 Cognitive / Appraisal perspectiveEmotions rise through complex cognitive evaluation ofworldMultiple “modules” represent different layers ofperception

3 Neurobiology shows evidence of both2-pathways modelFast (pessimistic) instinctive responseSlow (realistic) cognitive response

Nikolaos Malandrakis User emotion detection

Page 5: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Appraisal model

events and evolving evaluation. The level of processingcan be expected to move up in the course of thissequential course, given both the nature of the compu-tation and the likelihood that lower levels have beenunable to settle the issue. The normative significanceof the event, that is, its consequences for the self andits normative or moral status, is expected to beappraised last, as it requires comprehensiveinformation about the event and comparison withhigh-level propositional representation.

Therefore, the proposed mechanism is highly com-patible with the assumption of parallel processing. AllSECs are expected to be processed simultaneously,starting with relevance detection. However, the essen-tial criterion for the sequence assumption is the time atwhich a particular check achieves preliminary closure,that is, yields a reasonably definitive result, one thatwarrants efferent commands to response modalities,as shown by the descending arrows in figure 2. Thesequence theory postulates that, for the reasonsoutlined earlier, the result of a prior processing step(or check) must be in before the consecutive step (orcheck) can produce a conclusive result with efferentconsequences. It is indeed feasible to assume that theresults of parallel processes for different evaluationcriteria will be available at the different times, givendifferential depth of processing.

(d) Component patterningAs shown in figure 2, the fundamental assumption ofthe CPM is that the appraisal results drive the response

patterning in other components by triggering efferentoutputs designed to produce adaptive reactions thatare in line with the current appraisal results (oftenmediated by motivational changes). Thus, emotiondifferentiation is the result of the net effect of all sub-system changes brought about by the outcome profileof the SEC sequence. These subsystem changes aretheoretically predicted on the basis of a componentialpatterning model, which assumes that the differentorganismic subsystems are highly interdependent andthat changes in one subsystem will tend to elicit relatedchanges in other subsystems. As illustrated in figure 2,this process, similar to appraisal, is highly recursive,which is what one would expect from the neurophysio-logical evidence for complex feedback and feedforwardmechanisms between the subsystems (see neural archi-tecture discussion). As shown in figure 2, the result ofeach consecutive check is expected to differentially andcumulatively affect the state of all other subsystems.

The CPM makes specific predictions about theeffects of the results of certain appraisal checks onthe autonomic and somatic nervous systems, indicat-ing which somatovisceral changes and which motorexpression features are expected. These predictionsare briefly summarized in column 3 in table 1. Theyare based on both the general functions of the emotioncomponents and the specific functions of each SEC(see column 2 in table 1). In particular, specific moti-vational and behavioural tendencies are expected to beactivated in the motivation component in order toserve the specific requirements for the adaptive

attention

event implication coping normativesignificance

internal standardcompatibility

external standardcompatibility

control

power

adjustment

causality: agentnovelty(suddenness,familiarity,predictability)

intrinsicpleasantness

appraisalprocesses

autonomicphysiology

actiontendencies

motorexpression

subjectivefeeling

goal/needrelevance

causality: motive

outcome probability

discrepancy fromexpectation

conduciveness

urgency

relevance

memory motivation reasoning self

time

Figure 2. Comprehensive illustration of the CPM of emotion (Scherer 2001; Sander et al. 2005).

3466 K. R. Scherer Review. Emotions are emergent processes

Phil. Trans. R. Soc. B (2009)

on July 6, 2010rstb.royalsocietypublishing.orgDownloaded from

11Emotions are emergent processes, Klaus R. Scherer

Nikolaos Malandrakis User emotion detection

Page 6: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

2-pathways model

Nikolaos Malandrakis User emotion detection

Page 7: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Representation

1 CategoricalNominal language labelsEkman’s list: happiness, sadness, fear, anger, disgust,and surpriseMixed emotions

2 DimensionalMultiple uncorrelated (not really)dimensions

Valence-Arousal-Dominanceand subsets

!"#$"%%"&'(&

)*+,-.%/$(*0,1)*2"$

3+1,0"&3*"$2"0,1

4"5(+"&6"(1"7.58(59

)$-.%(5

:(5"*

1"

;,2<5=65"(%(*0

3+0$"9"5=>*#5"(%(*0

6"(

1"7.5

3*"$2"0,1

3 CombinationsDiscretized forms of dimensionsDegree of categorical emotion

Nikolaos Malandrakis User emotion detection

Page 8: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

The temporal aspect

1 MacroDifferent temporal framesVery long term (Personality)Long term (Mood)Short term (Emotion)

2 MicroEmotion as a temporal processPerceived emotion is combination/contrast of fast andslow responseExample: emotional effect of expectation2

time

event

imagination appraisalreaction

predictiontension

2Sweet anticipation - Music and the psychology of expectation, David Huron

Nikolaos Malandrakis User emotion detection

Page 9: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Common issues: Experienced - Expressed -Reported disjoint

response demanded by a particular SEC result. Insocially living species, adaptive responses are requirednot only for the internal regulation of the organismand motor action for instrumental purposes(organismic functions), but also for interaction andcommunication with conspecifics (social functions).

(e) Integration, central representationand labellingThe CPM assigns a special status to the feeling com-ponent in the emotion process, as it monitors andregulates the component process and enables the indi-vidual to communicate its emotional experience toothers. If subjective experience is to serve a monitoringfunction, it needs to integrate and centrally representall information about the continuous patterns ofchange and their coherence in all other components.Thus, feeling is an extraordinarily complex conglomer-ate of information from different systems. Figure 3shows how the different components of the emotionprocess might be integrated and represented in a uni-tary fashion in what philosophers have described asqualia (see Scherer in press a). As shown on the leftside of the figure, the ANS, the somatic or motor ner-vous system (SNS) and motivation components aredriven by the appraisal component (which are inturn influenced by the changes that occur in theseother components and which may be in part the resultsof component-specific factors). The current state ofeach of these components is then represented in anintegrated fashion in the feeling component. Quality,intensity and duration of the feeling are determinedby these integrated inputs. Appraisal results will berepresented by the patterning of the appraisal checkresults and the weights that are assigned to individualappraisal checks. Both SNS and ANS will be rep-resented as a function of the respective responsepatterns and their amplitudes. Finally, the motivationcomponent will be represented by the nature of theaction tendencies that have been elicited as well asby the estimated urgency of action.

Scherer (2004) describes in detail which integrationtasks need to be achieved in the process. Informationneeds to be integrated in the cognitive component asdifferent appraisal results may vary greatly with respectto the nature of the outcome. Information integrationis also required for the response components as theresponse modalities, e.g. physiological variables andexpressive behaviours, vary greatly with respect totheir underlying metric. Finally, multi-componentintegration is required to bring all the separateinformation channels together. In addition, temporalintegration has to be achieved to create the notion ofa coherent episode.

Additional questions concern the nature of the emer-gence of the qualia into consciousness and the process ofcategorization and verbalization. The model offers aconceptualization of the problem as shown in figure 4,using a Venn diagram in which a set of overlapping cir-cles represent the different aspects of feeling. The firstcircle (A) represents the sheer reflection or represen-tation of changes in all synchronized components insome kind of monitoring structure in the central

nervous system (CNS). This structure is expected toreceive massive projections from both cortical and sub-cortical CNS structures (including proprioceptivefeedback from the periphery). The second circle (B),only partially overlapping with the first, represents thatpart of the integrated central representation thatenters awareness and thereby becomes conscious,thus constituting the feeling qualities, the qualiathat philosophers and phenomenologically mindedpsychologists have been most concerned with. Thus,this circle corresponds most directly to what is generallycalled ‘feelings’. The conscious part of the feelingcomponent feeds the process of controlled regulation,much of which is determined by self-representationand socio-normative constraints. It is hypothesizedthat it is the degree of synchronization of the com-ponents (which might in turn be determined by thepertinence of the event as appraised by the organism)that will generate conscious experience.

Unfortunately, all we can currently measure is theindividual’s verbal account of a consciously experiencedfeeling, represented by the third circle (C) in figure 4.Drawing this circle as only partially overlapping withthe circle representing conscious experience (B) ismeant to suggest that the verbal account of feelings cap-tures only part of what is consciously experienced. Thisselectivity can be owing, in part, to control intentions—the individual may not want to report certain aspects ofhis/her innermost feelings. Most importantly, verbalreport relies on language and thereby on the emotioncategories and other pragmatic devices available to

qualityintensityduration

feeling

motivation

ANS

SNS

appraisal

patterning andweight of criteria

patterning andamplitude

patterning andamplitude

urgency

regulation

Figure 3. Mechanism of component integration.

unconscious reflectionand regulation

conscious representationand regulation

verbalization and communicationof emotional experience

zone of valid self-reportmeasurement

cognitiveappraisal

actiontendencies

motorexpressionphysiological

symptoms

Figure 4. A Venn model of component integration and therole of conscious feeling.

Review. Emotions are emergent processes K. R. Scherer 3467

Phil. Trans. R. Soc. B (2009)

on July 6, 2010rstb.royalsocietypublishing.orgDownloaded from

3

3Emotions are emergent processes, Klaus R. Scherer

Nikolaos Malandrakis User emotion detection

Page 10: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Common issues continued

1 Data sparsenessEmotion emergence not that common

2 Complexity of expressionMaskingActing

3 Multi-modal expression4 Context

Prior states, setting, topicStudies focus on out-of-context data

5 Stimulus-Reaction angleIdentifying the cause of an emotional responseExpected emotional responseLots of work on text, music, movie content analysis

Nikolaos Malandrakis User emotion detection

Page 11: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Data collection

The most controversial part of the process.

1 Acted vs Naturalistic emotionSolution to data sparsenessClear expressions/easier analysisLack of generalization

2 1st vs 3rd party annotations1st party can interfere with experienceNo way to validate3rd party limited to expressed emotions

3 Separate modality annotationAnnotate only one modality

Nikolaos Malandrakis User emotion detection

Page 12: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Modalities

1 Emotion expressed through different modalities2 Typically handled separately3 Correlation between intrusiveness and informativeness4 Quick overview in decreasing availability order

Nikolaos Malandrakis User emotion detection

Page 13: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Text

1 Very popular, high data availability2 Typed vs ASR output

Typed: Non-spontaneousASR: Not preferred modality / parser performance

3 Language modelingConditional probabilities P(s|c)Assumed conditional independenceClassification by Bayesian criterion

4 Hierarchical modelLexicon-basedSentence emotion as combination of term emotionTypically simple combination schemes

5 Adjectives, Verbs and POS tags are very commonfeatures

Nikolaos Malandrakis User emotion detection

Page 14: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Syntax-based composition example4important with respect to sentiment. The stronger ofthe IN constituents is here denoted as SPR (super-ordinate) whereas the label SUB (subordinate) refersto the dominated constituent (i.e. SPR � SUB). Ex-cept for (N)[=] SPR constituents, it is therefore theSPR constituent and the compositional processes exe-cuted by it that determine the polarity (α) of theOUT constituent (i.e. OUTαij → SPRαi + SUBαj ).The weights are not properties of individual IN con-stituents per se but are latent in specific syntactic con-structions such as [Mod:Adj Head:N] (e.g. adjectivalpremodification of head nouns) or [Head:V Comp:NP]

(e.g. direct object complements of verbs).We tag each entry in the sentiment lexica (across

all word classes) and each constituent with one of thefollowing tags: default ([=]), positive ([+]), nega-tive ([-]), and reverse ([¬]). These tags allow us tospecify at any structural level and composition stagewhat any given SPR constituent does locally to thepolarity of an accompanying SUB constituent with-out fixed-order windows of n tokens (cf. ([7]), modi-fication features in ([10]), change phrases in ([6])).A [=] SPR constituent combines with a SUB con-stituent in the default fashion. The majority of con-stituents are [=]. A [¬] SPR constituent reverses thepolarity of the SUB constituent and assigns that po-larity to the OUT constituent (cf. general polarityshifters in ([10])). As SPR constituents, some carrierssuch as “[contaminate]

(-)” or “[soothe]

(+)” exhibit

such strong sentiment that they can determine theOUT polarity irrespective of the SUB polarity - con-sider the static negativity in “[contaminated that damn

disk](-)

”, “[contaminated the environment](-)

”, and“[contaminated our precious water]

(-)” (vice versa for

some positive carriers). Hence the [-] and [+] constantswhich can furthermore be used as polarity heuristicsfor carriers occurring prototypically with a specific po-larity (e.g. “[deficiency (of sth positive)]

(-)”) (cf. pre-

suppositional items in ([7]), negative and positive po-larity shifters in ([10])).

Notice that the SPR constituent operates on theSUB constituent irrespective of the polarity of thelatter as a [¬] SPR constituent such as the deter-miner “[less]

(N)[¬]” reverses both (+) and (-) SUB

constituents (e.g. “[less tidy](-)

”, “[less ugly](+)

”),for example. However, cases in which SPR opera-tions are required only in conjunction with a specificSUB constituent polarity do exist. The reversal poten-tial in the degree modifier “[too]

(N)[¬]”, for instance,

seems to operate only alongside (+) SUB constituents(i.e. “[too colourful]

(-)” vs. “??[too sad]

(+)”). The

adjective “[effective](+)[=]

” operates similarly onlywith (+) or (N) SUB constituents (i.e. “[effective

remedies/diagrams](+)

” vs. “[effective torture](-)

”).It is thus proposed that (?:+) and (?:-) be used asfurther filters to block specific SPR polarities as re-quired by individual carriers.

To illustrate how the composition model operates,consider the sample sentence in Ex. 1:

1) The senators supporting(+)

the leader(+)

failed(-)

to praise(+)

his hopeless(-)

HIV(-)

prevention program.

Raw frequency counts, yielding three (+) and three

(-) carriers, would fail to predict the global negativepolarity in the sentence. We represent the sentenceas follows, starting with the direct object NP of thepredicator “[praise](+)[+]

” (Ex. 2):2) NP

(-)[=]

Subj-Det:NPgen

Head:N

his(N)[=]

Head:Nom(-)[=]

Mod:Adj

hopeless(-)[=]

Head:Nom(+)[=]

Mod:Nom(+)[=]

Mod:N

HIV(-)[=]

Head:N

prevention(N)[¬]

Head:N

Det

Head:N

senators(N)[=]

Comp:VP(+)[=]

Head:V

supporting(+)[=]

Comp:NP(+)[=]

Det:Det

the(N)[=]

Head:N

leader(+)[=]

4) S(-)[=]

Comp:NP(+)[=]

The sena-torssup-portingtheleader

Head:VP(-)[N]

Head:V

failed(-)[¬]

Comp:VP(+)[=]

Head:VGrp(+)[+]

Mod:TO

to(N)[=]

Head:V

praise(+)[+]

Comp:NP(-)[=]

his hopelessHIV preven-tion program

Through polarity reversal, the internal sentimentin “[HIV prevention]

(+)[=]” is first arrived at

due to the [¬] status of the SPR head noun“[prevention]

(N)[¬]” which reverses the (-) premo-

difying noun “[HIV](-)[=]

”. The (N) head noun“[program]

(N)[=]” is then overridden by the (+) pre-

modifying nominal “[HIV prevention](+)[=]

”. Whenthe resultant nominal is combined with the premodi-fying attributive SPR input “[hopeless](-)[=]

”, the en-suing polarity conflict can be resolved through the

4Sentiment composition, Moilanen & Pulman

Nikolaos Malandrakis User emotion detection

Page 15: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Speech Audio5

1 How something is said (ASR → what is said)2 Prosodic features and statistics

f0 and pitchEnergy/Power derived featuresDuration and Rate

3 Absolute vs Perception-rescaled values4 User adaptation via mean normalization or derivatives5 Possibly gender-specific models6 General procedure

Extract massive feature setPerform feature selectionClassify

5Emotional speech recognition: Resources, features, and methods, Ververidis & Kotropoulos

Nikolaos Malandrakis User emotion detection

Page 16: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Facial features2334 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007

Fig. 1. Audio-visual database collection. (a) shows the facial marker layoutand (b) shows the motion capture system.

features were parameterized using specific facial markers, as ex-plained in Section III-B. The complete set of the markers, headmotion, eyebrow, and lip features is referred together here on asfacial features.

A. Audio–Visual Database

The audiovisual database used in this work was recorded froman actress, who was asked to read a custom-made, phoneme-bal-anced corpus four times, expressing different emotions (happi-ness, sadness, anger, and neutral state). The use of a set of emo-tion categories simplifies the recognition and communication ofemotional state for both people and computers [34]. Therefore,for the sake of simplicity, and for guiding interface design, theproposed analysis uses emotion categories, which is more ap-pealing than the attribute characterization of emotion such asarousal and valence.

A detailed description of her facial expression and rigid headmotion was acquired by using 102 markers attached to the face[Fig. 1(a)]. A VICON motion capture system with three cameraswas used to capture the 3-D position of each marker [Fig. 1(b)].The sampling frequency was set to 120 Hz. The recording wasmade in a quiet room using a close-talking SHURE microphoneat a sampling rate of 48 kHz. The markers’ motions and thealigned audio were simultaneously captured by the system. Intotal, 612 sentences were used in this work. Note that the actressdid not receive any special instruction about how to express thetarget emotions and was asked to be natural.

Even though acted facial expressions have some differenceswith genuine facial expression [5], databases based on actorshave been widely used in the analysis of emotions. The main ad-vantage of this setting is that a balanced corpus can be designedin advance to include a wide range of phonetic and emotionalvariability. In addition, the proposed recording setting allowsus to use markers that provide detailed facial information whichcould be very difficult to obtain in a more unconstrained pro-duction scenario. Such data are particularly useful for the typesof analyses presented in this paper.

B. Feature Extraction

The pitch (F0), the rms energy and the 13 MFCC coefficientswere extracted using the Praat speech processing software [35].The analysis window was set to 25 ms with an overlap of 8.3 msproducing 60 frames per second. The smoothing and interpola-tion options of the Praat software were applied to remove any

Fig. 2. Face parameterization. (a) Facial markers subdivision (upper, middle,and lower face regions). (b) Head motion features. (c) Lip features.

spurious spike in the pitch estimates and to avoid zeros in un-voiced/silence regions, respectively. In addition, the first andsecond derivatives of the pitch and energy were added to theprosodic feature vector to incorporate their temporal dynamics.As shown in [25] and [36], these dynamic features improve thecorrelation levels of the audio–visual feature mapping. The firstcoefficient of the MFCCs was removed, since it provides infor-mation about the rms energy rather than the vocal-tract config-uration. The velocity and acceleration coefficients were also in-cluded as features. The dimension of this feature vector was re-duced from 36 to 12 using principal component analysis (PCA).This number was chosen to contain 95% of the variance of theMFCC features. This postprocessed feature vector is what willbe referred here on as MFCCs.

After the motion data were captured, all the markers weretranslated to make a nose marker at the local coordinate centerof each frame, removing any translation effect. After that, theframes were multiplied by a rotational matrix, which compen-sates for rotational effects. This matrix was constructed for eachframe as follows. A neutral pose of the face was chosen as a ref-erence frame, which was used to create a 102 3 matrix, ,in which the row of has the 3-D position of the markers.For the frame , a similar matrix was created by followingthe same marker order as the reference. After that, the singularvalue decomposition (SVD), , of matrix wascalculated. Finally, the product of gave the rotational ma-trix, , for the frame [37]

(1)

(2)

After compensating for the translation and rotation effects,the remaining motion between frames corresponds to local dis-placements of the markers, which define the subject’s facial ex-pressions. To synchronize the frames of the acoustic featureswith the frames of facial features, the marker data was down-sampled from 120 to 60 frames per second.

In the analyses, each of the facial markers, except the refer-ence nose marker, was used as a facial feature. The markers weregrouped into three main areas: upper, middle, and lower faceregions [Fig. 2(a)]. The upper face region includes the markersabove the eyes in the forehead and brow area. As Ekman et al.observed, this facial area is strongly shaped by the emotion con-veyed in the facial expression [6]. The lower face region groupsthe markers below the upper lip, including the mouth and jaw.As discussed in Section IV-B, this area is modulated not onlyby the emotional content, but also by the articulatory processes.

6

1 Facial expressionCoded through relative positions of facial featuresFacial Action Coding System (FACS): taxonomy ofemotional expressionsForward face orientation

2 GazeFixation/Aversion indicate emotional stateRequires knowledge of outside world

3 Data acquisitionVery unreliable in real world conditions

6Interrelation Between Speech and Facial Gestures in Emotional Utterances, Busso & Narayanan

Nikolaos Malandrakis User emotion detection

Page 17: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Body gestures10 Carlos Busso et al

Figure 3. VICON motion capture system with 8 cameras. The subject with the

markers sat in the middle of the room, with the cameras directed to him/her. The

subject without the markers sat outside the field of view of the VICON cameras,

facing the subject with markers.

seen in Figure 3. The sample rate of the motion capture system was 120frames per second. To avoid having gestures outside the volume definedby the common field of view of the VICOM cameras, the subjects wereasked to be seated during the recording. However, they were instructedto gesture as naturally as possible, while avoiding occluding their facewith the hands. The subject without the markers was sitting out of thefield of view of the VICON cameras to avoid possible interferences. Asa result of this physical constraint, the actors were separated approxi-mately three meters from each other. Since the participants were withinthe social distance as defined by Hall [39], we expect that the influenceof proxemics did not affect their natural interaction. At the beginningof each recording session, the actors were asked to display a neutralpose of the face for approximately two seconds. This information canbe used to define a neutral pose of the markers.

The audio was simultaneously recorded using two high qualityshotgun microphones (Schoeps CMIT 5U) directed at each of theparticipants. The sample rate was set to 48KHz. In addition, two high-resolution digital cameras (Sony DCR-TRV340) were used to record asemi frontal view of the participants (see Fig. 5). These videos wereused for emotion evaluation, as will be discussed in Section 5.

The recordings were synchronized by using a clapboard with re-flective markers attached to its ends. Using the clapboard, the variousmodalities can be accurately synchronized with the sounds collected bythe microphone, and the images recorded by the VICON and digitalcameras. The cameras and microphones were placed in such a waythat the actors could face each other, a necessary condition for naturalinteraction. Also, the faces were within the line of sight - not talking

LRE_2008.tex; 11/09/2008; 12:28; p.10

7

1 Posture and motion2 Mostly focused on limb movement3 Sample features

Shoulder orientationElbow-torso distanceHand shape (open, claw, fist)

7IEMOCAP: Interactive emotional dyadic motion capture database, Busso et al

Nikolaos Malandrakis User emotion detection

Page 18: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Biometrics

1 Galvanic Skin Response (GSR)skin conductance↑ → ↑ arousal or stress

2 Electromyography (EMG)frequency of muscle tensioncorrelates with negative valence

3 Blood Volume Pulse (BVP)blood flowcorrelates with negative valence

4 Electrocardiogram (ECG)contractile activity of the heartrelaxation, stress and frustration

5 Respiration ratebreath speed and depthirregularity, speed → arousal

Nikolaos Malandrakis User emotion detection

Page 19: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Multimodal Fusion

1 Combination of information from different modalities2 Modalities may agree, amplify, modulate or override

each other3 Asynchronous reactions4 Early fusion

Combination of featuresUsually concatenation of feature vectorsVery little work on more complicated methods

5 Late fusionCombining partial decisionsConsidered sub-optimalRequires per-modality annotations

Nikolaos Malandrakis User emotion detection

Page 20: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

EmoVoice - A framework for online recognitionof emotions from voice

1 Online voice emotion detectionSegmentation via VADBrute force feature generation: prosody and voicequality × functionals = 1302 featuresOptional feature selectionNaive Bayes (and SVM) classifiers

2 Applicationsempathy: agent reflects user’s emotionkaleidoscope & E-tree: visualizing user emotion

user. For the first scenario, Hegel et al. [Hegel et al., 2006] show in a user study

Fig. 3. Conversation with a virtual agent showing empathy by mirroring the user’semotional state in her face (upper row from left to right: joy, sadness, anger).

a preference of the emotionally reacting robot over a robot without emotionrecognition. For the second scenario, the formal proof of this is still pending butdue to the more subtle emotional response by the Greta agent, we expect aneven stronger effect.

Other applications are of rather artistic nature having the goal of visualisingemotions and allowing users to express themselves emotionally. One of them isan animated kaleidoscope that changes according to a speaker’s emotions (seeFig. 4). Within the EU project Callas1, showcases of interactive art are beingdeveloped that respond to the multimodal emotional input of spectators. Some ofthem are intended to be used primarily by professional actors, and it is assumedand encouraged that users express themselves with strong, possibly exaggeratedand acted emotions. For this reason these scenarios are ideally suited for thecurrent state of the art in emotion recognition technology. One of the showcases

1 http://www.callas-newmedia.eu

Nikolaos Malandrakis User emotion detection

Page 21: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Toward Detecting Emotions in Spoken Dialogs

1 Per-utterance emotion classificationNaturalistic dataNegative vs non-negative emotion

2 Acoustic featuresF0, Energy, duration, formantsfeature selection: wrapper, forward, best-first or PCAseparate for males and females

3 Lexical featurescategory-word mutual information → word selectionclassification by sum of mutual information scores

4 Discourse featuresrejection, repeat, rephrase, ask-start over, none of theabove

5 Fusion by posterior average

Nikolaos Malandrakis User emotion detection

Page 22: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Results (Error rate, Male)

300 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 2, MARCH 2005

Fig. 3. Comparison of combination of input features. Ac acoustic correlates, Lan language information, Dis discourse information. Four sets of featuresare explored. (a) Male data using LDC, and (b) Male data using k-NN with in the classification of acoustic correlates; LDCs were used for classificationwith lexical and discourse information.

Fig. 4. Comparison of combination of input features. Ac acoustic correlates, Lan language information, Dis discourse information. Four sets offeatures are explored. (a) Female data using LDC, and (b) Female data using k-NN with in the classification of acoustic correlates, LDCs were used for otherinformation.

TABLE VCLASSIFICATION ERROR AND ITS STANDARD DEVIATION FOR MALE DATA IN PERCENTILE WITH DIFFERENT FEATURE SETS (BASE 21 dim, f10 10 BESTFEATURE, f15 15 BEST FEATURE, PCA FEATURE SET BY PCA.) Ac ACOUSTIC CORRELATES, LAN LANGUAGE INFORMATION, DIS DISCOURSE

INFORMATION. THE BASELINE WAS THE CASE WHEN ALL THE DATA HAS BEEN CLASSIFIED AS NON-NEGATIVE. THE BEST PERFORMINGRESULT WITH RESPECT TO CLASSIFICATION ERROR IS BOLD-FACED IN EACH CASE

related to each other, it is highly likely to give a similar deci-sion on a given input data. To see the dependency between theclassifiers of lexical and discourse information, we calculatedthe Q-statistic, which measures the similarity between classi-fiers [27]. Q-statistic provides a pairwise symmetrical measure

of similarity. For two classifiers and , the Q-statistic is de-fined as

(16)

Nikolaos Malandrakis User emotion detection

Page 23: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Results (Error rate, Female)

LEE AND NARAYANAN: TOWARDS DETECTING EMOTIONS IN SPOKEN DIALOGS 301

TABLE VICLASSIFICATION ERROR AND ITS STANDARD DEVIATION FOR FEMALE DATA IN PERCENTILE WITH DIFFERENT FEATURE SETS (BASE 21 dim, f10 10 BESTFEATURE, f15 15 BEST FEATURE, PCA FEATURE SET BY PCA.) AC ACOUSTIC CORRELATES, LAN LANGUAGE INFORMATION, DIS DISCOURSE

INFORMATION. THE BASELINE WAS THE CASE WHEN ALL THE DATA HAS BEEN CLASSIFIED AS NON-NEGATIVE. THE BEST PERFORMINGRESULT WITH RESPECT TO CLASSIFICATION ERROR IS BOLD-FACED IN EACH CASE

TABLE VIIQ STATISTICS FOR PAIRWISE INFORMATION IN BOTH MALE AND FEMALE.

ACOUS ACOUSTIC CORRELATES, LANG LEXICAL INFORMATION,DISC DISCOURSE INFORMATION

where• is the number of both classifiers making the correct

classification;• is the number of being correct and being incor-

rect;• is the number of being incorrect and being cor-

rect;• is the number of both classifiers making incorrect

decision.is the total number of data.

has a value between and 1. For a statistically independentclassifier, is 0, and the higher the absolute value of , themore dependent the classifiers are.

The pairwise Q-statistic is shown in Table VII for classifierswith acoustic, lexical, and discourse information sources. Forthe results, we divided the data into training and testing, andthen ’s were obtained only from the testing data. As ex-pected, Q-statistic between the classifiers from language anddiscourse information is almost 1. This is the reason why the ad-dition of discourse information to acoustic and lexical informa-tion was not be an improvement. It has been known that the useof classifiers using the information sources dependent to eachother cannot improve the performance; it can even worsen per-formance [27].

VII. DISCUSSION

Automatic recognition of emotions from human speech bymachines is gaining increasing attention from the engineeringcommunity. The performance by a computer, and the emotionalcategories it can cover, are far limited compared with thosecapable by humans. One main difficulty comes from the factthat there is a lack of complete understanding of emotions inhuman minds, including a lack of agreement among psycholog-ical researchers. Agreement among researchers is a prerequi-site to satisfaction in attempting to build an effective machinefor the task of automatic emotion recognition. Even human be-ings have difficulty categorizing emotions, as evident in the lowkappa statistic values in our corpus. However, we believe that wecan design algorithms that will perform reasonably well in con-strained domain-specific applications, such as automated callcenter applications that we focused on in this paper. The knowl-edge gained from these efforts can help us understand deeperissues and potentially extend to more general applications.

In this paper, we explored domain specific emotion recogni-tion from speech signals using data obtained from a real-worldcall center dialog application. Language and discourse informa-tion, as well as acoustic features that most studies have focusedon, were explored to improve the performance of an emotionrecognizer. The results show that significant improvements canbe made by combining these information sources in the sameframework. An information-theoretic concept of emotionalsalience was used to obtain language-related features, whichwere termed as emotionally salient words in an utterance. Fordiscourse information, we considered a simple set of five speechact categories related to the users’ response to the automatedsystem. There are several open issues that need to be furtherexplored in the future.

First of all, data sparsity is a significant problem in the mod-eling of lexical information. In the test phase of classification

Nikolaos Malandrakis User emotion detection

Page 24: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Prosody-based automatic detection ofannoyance and frustration in H-C dialog

1 Per-utterance frustration and annoyance detection2 Naturalistic data3 Original Agreement vs Consensus4 Style features:

hyperarticulation, pausing, raised voice5 Repetition features

no repeat, repeat, repeat with correction6 Data quality

joking, system cutoff

7 Audio features: standards8 Linguistic features: LM likelihood difference sign9 Decision Tree & wrapper, best first, forward selection

Nikolaos Malandrakis User emotion detection

Page 25: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Results (Accuracy, Efficiency)Table 3. Summary of experimental results. “STYLE” = speaking style features; “REP” = repeat/correction features; “LM” = languagemodel features; “Consensus version” = emotion labels arrived at after labelers resolved any disagreements; “Originally agreed” = subsetof utterances on which individual labelers had agreed on first labeling pass; “Acc” = accuracy (linear average of 20 separate experiments);“Eff” = efficiency (linear average of 20 experiments). Note: LM features were computed for the first task only, although in principle couldbe computed for both. Accuracies reflect simulated equal class distributions in the test set through sample weighting.

ANNOY.+FRUST. vs. ELSE FRUST. vs. ELSETrue words ASR words True words ASR wordsAcc Eff Acc Eff Acc Eff Acc Eff

Each human with other human, overall 72.6 68.8Human with human “Consensus” (biased) 83.9 77.3Consensus version, [All Features] 80.2 32.7 93.2 67.2Originally agreed, [All Features] 85.4 47.2 91.8 63.3Consensus version, [no STYLE] (“Baseline”) 75.2 21.2 75.1 21.9 86.4 46.5 87.0 49.5Originally agreed, [no STYLE] 80.0 32.0 78.5 28.2 86.4 44.6 85.7 46.9Consensus version, [no STYLE, no REP] 71.1 14.6 70.7 14.8 84.2 39.7 86.7 47.9Originally agreed, [no STYLE, no REP] 77.1 23.0 74.5 18.6 80.4 31.8 83.6 39.6Consensus version, [REP only] 69.8 12.8 76.6 21.1Originally agreed, [REP only] 74.7 18.5 85.4 14.3Consensus version, [LM only] 65.6 3.8Originally agreed, [LM only] 64.5 -0.9

ances in the call. Longer durations and slower speaking rates wereassociated with frustration. Pitch features represented about 26%of total usage, and included the maximum F0 in the longest vowel,the maximum overall F0, the times that the maximum and mini-mum F0s occured, the maximum speaker-normalized F0 rise, andthe distance of various F0 statistics from the speaker baseline. Allwere associated with frustration when their values were high. Therepeat/correction feature represented roughly 26% of total usageas well, with (as expected) more frustration after system errors.The speaker-normalized RMS energy accounted for 11% of theusage, and the remaining 8% of usage was from features trackingthe number of dialog exchanges between the user and system thusfar.

The experiments showed that among the speaking style fea-tures, only raised voice is a helpful predictor for emotion. Hy-perarticulation and pauses between syllables and words were notuseful. This indicates that it is not crucial to detect hyperarticu-lation for emotion detection, and confirms our initial decision totreat the two phenomena as separate. However, our prosodic fea-tures could be useful in detecting hyperarticulation itself, althoughthis remains an interesting open question for further study.

4. ACKNOWLEDGMENTS

Kai Filion, Mercedes Carter, and Kattya Baltodano participated in the firstdata labeling pass. Harry Bratt and Kemal Sonmez developed the pitchstylizer used in feature computation. We thank the Communicator teams atCU, CMU, Lucent, and SRI for providing the data to our project, and EricFosler-Lussier, Katrin Kirchhoff, and Mari Ostendorf for valuable discus-sions. This work was funded by the DARPA ROAR program under con-tract N66001-99-D-8504, by NASA award NCC 2-1256, by NSF STIM-ULATE grant IRI-9619921, and by the DARPA Communicator project atICSI and U. Washington. The views herein are those of the authors and donot reflect the policies of the funding agencies.

5. REFERENCES

[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias,W. Fellenz, and J. Taylor, “Emotion recognition in human-computerinteraction”, IEEE Signal Processing Magazine, vol. 18, pp. 32–80,January 2001.

[2] T. Moriyama and S. Ozawa, “Emotion recognition and synthesis sys-tem on speech”, in Proceedings from IEEE International Conferenceon Multimedia Computing and Systems, vol. 1, pp. 840–844, June1999.

[3] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition inspeech using neural networks”, Neural Computing and Applications,vol. 9, pp. 290–296, 2000.

[4] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion inspeech”, in H. T. Bunnell andW. Idsardi, editors, Proc. ICSLP, vol. 3,pp. 1970–1973, Philadelphia, Oct. 1996.

[5] A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Noth,“Desparately seeking emotions, or: Actors, wizards, and human be-ings”, in Proceedings of the ISCA Workshop on Speech and Emotion,pp. 195–200, Belfast, Sep. 2000.

[6] C. M. Lee, S. Narayanan, and R. Pieraccini, “Recognition of nega-tive emotions from the speech signal”, in Proceedings IEEE Auto-matic Speech Recognition and Understanding Workshop, Madonnadi Campiglio, Italy, Dec. 2001.

[7] M. Walker, J. Aberdeen, J. Boland, E. Bratt, J. Garafolo,L. Hirschman, A. Le, S. Lee, S. Narayanan, K. Papineni, B. Pel-lom, J. Polifroni, A. Potamianos, P. Prabhu, A. Rudnicky, G. Sanders,S. Seneff, D. Stallard, and S. Whittaker, “DARPA Communicator di-alog travel planning systems: The June 2000 data collection”, inP. Dalsgaard, B. Lindberg, H. Benner, and Z. Tan, editors, Proc. EU-ROSPEECH, pp. 1371–1374, Aalborg, Denmark, Sep. 2001.

[8] “The Multiparty Discourse Group”, http://www.cs.rochester.edu/-research/cisd/resources/damsl/, April 2002.

[9] K. Kirchhoff, “A comparison of classification techniques for the au-tomatic detection of error corrections in human-computer dialogues”,in Proceedings of the NAACL Workshop on Adaptation in DialogueSystems, Pittsburgh, PA, June 20001.

[10] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao Gadde,M. Plauche, C. Richey, E. Shriberg, K. Sonmez, F. Weng, andJ. Zheng, “The SRI March 2000 Hub-5 conversational speech tran-scription system”, in Proceedings NIST Speech Transcription Work-shop, College Park, MD, May 2000.

[11] K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub, “Modeling dy-namic prosodic variation for speaker verification”, in R. H. Mannelland J. Robert-Ribes, editors, Proc. ICSLP, vol. 7, pp. 3189–3192,Sydney, Dec. 1998. Australian Speech Science and Technology As-sociation.

No sample distributions are provided

Nikolaos Malandrakis User emotion detection

Page 26: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Predicting Emotion in Spoken Dialogue fromMultiple Knowledge Sources I

1 Per-turn neg-neu-pos classification2 Naturalistic data, tutoring system, WoZ

All questions labeled as negativeGrounding labeled as neutral[90, 280, 15] samples

3 Standard audio features4 Lexical features: lexical item frequency vector5 Automatic features: turn start & end time, barge-in,

overlap, # of words and syllables in turn6 Manual features: # false starts in turn,

isPriorTutorQuestion, isQuestion, isSemanticBarge-in,# canonical expressions in turn, isGrounding

7 Identifier features: subject, gender, problem

Nikolaos Malandrakis User emotion detection

Page 27: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Predicting Emotion in Spoken Dialogue fromMultiple Knowledge Sources II

8 Context featuresLocal: Values across 2 previous turnsGlobal: Average & sum across all previous turns

9 Adaboost & decision tree

Nikolaos Malandrakis User emotion detection

Page 28: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Results: Sound vs Text

standard error (SE)6 of AdaBoost on the 8 feature setsfrom Figure 3, computed across 10 runs of 10-fold cross-validation.7 Although not shown in this and later ta-bles, all of the feature sets examined in this paper pre-dict emotion significantly better than a standard majorityclass baseline algorithm (always predict “neutral”, whichyields an accuracy of 72.74%). For Table 1, AdaBoost’simprovement for each feature set, relative to this baselineerror of 27.26%, averages 24.40%, and ranges between12.69% (“speech-ident”) and 43.87% (“alltext+ident”).8

Feature Set -ident SE +ident SEspeech 76.20 0.55 77.41 0.52lexical 78.31 0.44 79.55 0.27autotext 80.38 0.43 81.19 0.35alltext 83.19 0.30 84.70 0.20

Table 1: %Correct on Speech vs. Text (cross-val.)

As shown in Table 1, the best accuracy of 84.70%is achieved on the “alltext+ident” feature set. This ac-curacy is significantly better than the accuracy of theseven other feature sets,9 although the difference be-tween the “+/-ident” versions was not significant for anyother pair besides “alltext”. In addition, the results offive of the six text-based feature sets are significantlybetter than the results of both acoustic-prosodic featuresets (“speech +/- ident”). Only the text-only feature set(“lexical-ident”) did not perform statistically better than“speech+ident” (although it did perform statistically bet-ter than “speech-ident”). These results show that whileacoustic-prosodic features can be used to predict emo-tion significantly better than a majority class baseline, us-ing only non-acoustic-prosodic features consistently pro-duces even significantly better results. Furthermore, themore text-based features the better, i.e., supplementinglexical items with additional features consistently yieldsfurther accuracy increases. While adding in the subject-and problem- specific “+ident” features improves the ac-curacy of all the “-ident” feature sets, the improvement isonly significant for the highest-performing set (“alltext”).The next question we addressed concerns whether

combinations of acoustic-prosodic and other types of fea-6We compute the SE from the std. deviation (std(x)/sqrt(n),

where n=10 (runs)), which is automatically computed in Weka.7For each cross-validation, the training and test data are

drawn from turns produced by the same set of speakers. Wealso ran cross-validations training on n-1 subjects and testingon the remaining subject, but found our results to be the same.

8Relative improvement over the baseline error for featureset x = , where error(x) is 100 minusthe %correct(x) value shown in Table 1.

9For any feature set, the mean +/- 2*SE = the 95% con-fidence interval. If the confidence intervals for two featuresets are non-overlapping, then their mean accuracies are sig-nificantly different with 95% confidence.

tures can further improve AdaBoost’s predictive accu-racy. We investigated AdaBoost’s performance on theset of 6 feature sets formed by combining the “speech”acoustic-prosodic set with each text-based set, both withand without identifier features, as shown in Table 2.

Feature Set -ident SE +ident SElexical+speech 79.26 0.46 79.09 0.36autotext+speech 79.64 0.47 79.36 0.48alltext+speech 83.69 0.36 84.26 0.26

Table 2: %Correct on Speech+Text (cross-val.)AdaBoost’s best accuracy of 84.26% is achieved on the

“alltext+speech+ident” combined feature set. This resultis significantly better than the % correct achieved on thefour “autotext” and “lexical” combined feature sets, butis not significantly better than the “alltext+speech-ident”feature set. Furthermore, there was no significant dif-ference between the results of the “autotext” and “lexi-cal” combined feature sets, nor between the “-ident” and“+ident” versions for the 6 combined feature sets.Comparing the results of these combined (speech+text)

feature sets with the speech versus text results in Table 1,we find that for autotext+speech-ident and all +ident fea-ture sets, the combined feature set slightly decreases pre-dictive accuracy when compared to the correspondingtext-only feature set. However, there is no significantdifference between the best results in each table (all-text+speech+ident vs. alltext+ident).

Emotion Class Precision Recall F-Measurenegative 0.71 0.60 0.65neutral 0.86 0.92 0.89positive 0.50 0.27 0.35

Table 3: Other Metrics on “alltext+speech+ident” (LOO)

In addition to accuracy, other important evalua-tion metrics include recall, precision, and F-Measure( ). Table 3 shows AdaBoost’s per-formance with respect to these metrics across emotionclasses for the “alltext+speech+ident” feature set, usingleave-one-out cross validation (LOO). AdaBoost accu-racy here is 82.08%. As shown, AdaBoost yields the bestperformance for the neutral (majority) class, and has bet-ter performance for negatives than for positives. We alsofound positives to be the most difficult emotion to anno-tate. Overall, however, AdaBoost performs significantlybetter than the baseline, whose precision, recall and F-measure for negatives and positives is 0, and for neutralsis 0.727, 1, and 0.842, respectively.6 Adding Context-Level FeaturesResearch in other domains (Litman et al., 2001; Batlineret al., 2003) has shown that features representing the di-

alogue context can sometimes improve the accuracy ofpredicting negative user states, compared to the use offeatures computed from only the turn to be predicted.Thus, we investigated the impact of supplementing ourturn-level features in Figure 2 with the features in Fig-ure 4, representing local and global10 aspects of the priordialogue, respectively.

Local Features: feature values for the two studentturns preceding the student turn to be predicted

Global Features: running averages and totals foreach feature, over all student turns preceding theturn to be predicted

Figure 4: Contextual Features for Machine Learning

We next performed machine learning experiments us-ing our two original speech-based feature sets (“speech+/- ident”), and four of our text-based feature sets (“au-totext” and “alltext” +/- ident), each separately supple-mented with local, global, and local+global features. Ta-ble 4 presents the results of these experiments.

Feature Set -ident SE +ident SEspeech+loc 76.90 0.45 76.95 0.40speech+glob 77.77 0.52 78.02 0.33speech+loc+glob 77.00 0.46 76.88 0.47autotext+loc 78.06 0.33 78.24 0.45autotext+glob 79.35 0.18 80.39 0.43autotext+loc+glob 77.67 0.54 77.74 0.48alltext+loc 80.33 0.46 80.99 0.40alltext+glob 83.85 0.37 83.74 0.55alltext+loc+glob 81.02 0.35 81.23 0.58

Table 4: %Correct, Speech vs. Text, +context (cross-val.)

AdaBoost’s best accuracy of 83.85% is achieved onthe “alltext+glob-ident” combined feature set. This re-sult is not significantly better than the % correct achievedon its “+ident” counterpart, but both of these results aresignificantly better than the % correct achieved on allother 16 feature sets. Moreover, all of the results forboth the “alltext” and “autotext” feature sets were sig-nificantly better than the results for all of the “speech”feature sets. Although the “alltext+loc” feature sets werenot significantly better than the best autotext feature sets(autotext+glob), they were better than the remaining “au-totext” feature sets, and the “alltext+loc+glob” featuresets were better than all of the autotext feature sets. Forall feature sets, the difference between the “-ident” and10Running totals are only computed for numeric features if

the result is interpretable, e.g., for turn duration, but not fortempo. Running averages for text-based features additionallyinclude a “# turns so far” feature and a “# essays so far” feature.

“+ident” versions was not significant. In sum, we seeagain that the more text-based features the better: addingtext-based features again consistently improves resultssignificantly. We also see that global features performbetter than local features, and while global+local performbetter than local features, global features alone consis-tently yield the best performance.Comparing these results with the results in Tables 1

and 2, we find that while overall the performance ofcontextual non-combined feature sets shows a smallperformance increase over most non-contextual com-bined or non-combined feature sets, there is again aslight decrease in performance across the best resultsin each table. However, there is no significant differ-ence between these best results (alltext+glob-identvs. all-text+speech+ident vs. alltext+ident).Table 5 shows the results of combining speech-based

and text-based contextual feature sets. We investigatedAdaBoost’s performance on the 12 feature sets formedby combining the “speech” acoustic-prosodic set with our“autotext” and “alltext” text-based feature sets, both withand without identifier features, and each separately sup-plemented with local, global, and local+global features.

Feature Set -iden SE +iden SEauto+speech+lo 78.23 0.39 77.30 0.52auto+speech+gl 79.33 0.22 78.84 0.39auto+speech+lo+gl 78.26 0.20 78.01 0.43all+speech+lo 82.44 0.31 82.15 0.56all+speech+gl 84.75 0.32 84.35 0.20all+speech+lo+gl 81.43 0.28 81.04 0.43

Table 5: %Correct on Text+Speech+Context (cross-val.)AdaBoost’s best accuracy of 84.75% is achieved on

the “alltext+speech+glob-ident” combined feature set.This result is not significantly better than the % correctachieved on its “+ident” counterpart, but both results aresignificantly better than the % correct achieved on all 10other feature sets. In fact, all the “alltext” results are sig-nificantly better than all the “autotext” results. Again forall feature sets, the difference between the “-ident” and“+ident” versions was not significant. In sum, addingtext-based features again consistently improves resultssignificantly, and global features alone consistently yieldthe best performance. Although the best result across allexperiments is that of “alltext + speech + glob - ident”,there is no significant difference between the best resultshere and those in our three other experimental conditions.A summary figure of our best results for text (all-

text) and speech alone, then combined with each otherand with our best result for context (global), is shownin Figure 5, for the “+/- ident” conditions; baseline per-formance is also shown. As shown, the accuracy of the“-ident” condition monotonically increases as features

Nikolaos Malandrakis User emotion detection

Page 29: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Results: Sound and Textstandard error (SE)6 of AdaBoost on the 8 feature setsfrom Figure 3, computed across 10 runs of 10-fold cross-validation.7 Although not shown in this and later ta-bles, all of the feature sets examined in this paper pre-dict emotion significantly better than a standard majorityclass baseline algorithm (always predict “neutral”, whichyields an accuracy of 72.74%). For Table 1, AdaBoost’simprovement for each feature set, relative to this baselineerror of 27.26%, averages 24.40%, and ranges between12.69% (“speech-ident”) and 43.87% (“alltext+ident”).8

Feature Set -ident SE +ident SEspeech 76.20 0.55 77.41 0.52lexical 78.31 0.44 79.55 0.27autotext 80.38 0.43 81.19 0.35alltext 83.19 0.30 84.70 0.20

Table 1: %Correct on Speech vs. Text (cross-val.)

As shown in Table 1, the best accuracy of 84.70%is achieved on the “alltext+ident” feature set. This ac-curacy is significantly better than the accuracy of theseven other feature sets,9 although the difference be-tween the “+/-ident” versions was not significant for anyother pair besides “alltext”. In addition, the results offive of the six text-based feature sets are significantlybetter than the results of both acoustic-prosodic featuresets (“speech +/- ident”). Only the text-only feature set(“lexical-ident”) did not perform statistically better than“speech+ident” (although it did perform statistically bet-ter than “speech-ident”). These results show that whileacoustic-prosodic features can be used to predict emo-tion significantly better than a majority class baseline, us-ing only non-acoustic-prosodic features consistently pro-duces even significantly better results. Furthermore, themore text-based features the better, i.e., supplementinglexical items with additional features consistently yieldsfurther accuracy increases. While adding in the subject-and problem- specific “+ident” features improves the ac-curacy of all the “-ident” feature sets, the improvement isonly significant for the highest-performing set (“alltext”).The next question we addressed concerns whether

combinations of acoustic-prosodic and other types of fea-6We compute the SE from the std. deviation (std(x)/sqrt(n),

where n=10 (runs)), which is automatically computed in Weka.7For each cross-validation, the training and test data are

drawn from turns produced by the same set of speakers. Wealso ran cross-validations training on n-1 subjects and testingon the remaining subject, but found our results to be the same.

8Relative improvement over the baseline error for featureset x = , where error(x) is 100 minusthe %correct(x) value shown in Table 1.

9For any feature set, the mean +/- 2*SE = the 95% con-fidence interval. If the confidence intervals for two featuresets are non-overlapping, then their mean accuracies are sig-nificantly different with 95% confidence.

tures can further improve AdaBoost’s predictive accu-racy. We investigated AdaBoost’s performance on theset of 6 feature sets formed by combining the “speech”acoustic-prosodic set with each text-based set, both withand without identifier features, as shown in Table 2.

Feature Set -ident SE +ident SElexical+speech 79.26 0.46 79.09 0.36autotext+speech 79.64 0.47 79.36 0.48alltext+speech 83.69 0.36 84.26 0.26

Table 2: %Correct on Speech+Text (cross-val.)AdaBoost’s best accuracy of 84.26% is achieved on the

“alltext+speech+ident” combined feature set. This resultis significantly better than the % correct achieved on thefour “autotext” and “lexical” combined feature sets, butis not significantly better than the “alltext+speech-ident”feature set. Furthermore, there was no significant dif-ference between the results of the “autotext” and “lexi-cal” combined feature sets, nor between the “-ident” and“+ident” versions for the 6 combined feature sets.Comparing the results of these combined (speech+text)

feature sets with the speech versus text results in Table 1,we find that for autotext+speech-ident and all +ident fea-ture sets, the combined feature set slightly decreases pre-dictive accuracy when compared to the correspondingtext-only feature set. However, there is no significantdifference between the best results in each table (all-text+speech+ident vs. alltext+ident).

Emotion Class Precision Recall F-Measurenegative 0.71 0.60 0.65neutral 0.86 0.92 0.89positive 0.50 0.27 0.35

Table 3: Other Metrics on “alltext+speech+ident” (LOO)

In addition to accuracy, other important evalua-tion metrics include recall, precision, and F-Measure( ). Table 3 shows AdaBoost’s per-formance with respect to these metrics across emotionclasses for the “alltext+speech+ident” feature set, usingleave-one-out cross validation (LOO). AdaBoost accu-racy here is 82.08%. As shown, AdaBoost yields the bestperformance for the neutral (majority) class, and has bet-ter performance for negatives than for positives. We alsofound positives to be the most difficult emotion to anno-tate. Overall, however, AdaBoost performs significantlybetter than the baseline, whose precision, recall and F-measure for negatives and positives is 0, and for neutralsis 0.727, 1, and 0.842, respectively.6 Adding Context-Level FeaturesResearch in other domains (Litman et al., 2001; Batlineret al., 2003) has shown that features representing the di-

alogue context can sometimes improve the accuracy ofpredicting negative user states, compared to the use offeatures computed from only the turn to be predicted.Thus, we investigated the impact of supplementing ourturn-level features in Figure 2 with the features in Fig-ure 4, representing local and global10 aspects of the priordialogue, respectively.

Local Features: feature values for the two studentturns preceding the student turn to be predicted

Global Features: running averages and totals foreach feature, over all student turns preceding theturn to be predicted

Figure 4: Contextual Features for Machine Learning

We next performed machine learning experiments us-ing our two original speech-based feature sets (“speech+/- ident”), and four of our text-based feature sets (“au-totext” and “alltext” +/- ident), each separately supple-mented with local, global, and local+global features. Ta-ble 4 presents the results of these experiments.

Feature Set -ident SE +ident SEspeech+loc 76.90 0.45 76.95 0.40speech+glob 77.77 0.52 78.02 0.33speech+loc+glob 77.00 0.46 76.88 0.47autotext+loc 78.06 0.33 78.24 0.45autotext+glob 79.35 0.18 80.39 0.43autotext+loc+glob 77.67 0.54 77.74 0.48alltext+loc 80.33 0.46 80.99 0.40alltext+glob 83.85 0.37 83.74 0.55alltext+loc+glob 81.02 0.35 81.23 0.58

Table 4: %Correct, Speech vs. Text, +context (cross-val.)

AdaBoost’s best accuracy of 83.85% is achieved onthe “alltext+glob-ident” combined feature set. This re-sult is not significantly better than the % correct achievedon its “+ident” counterpart, but both of these results aresignificantly better than the % correct achieved on allother 16 feature sets. Moreover, all of the results forboth the “alltext” and “autotext” feature sets were sig-nificantly better than the results for all of the “speech”feature sets. Although the “alltext+loc” feature sets werenot significantly better than the best autotext feature sets(autotext+glob), they were better than the remaining “au-totext” feature sets, and the “alltext+loc+glob” featuresets were better than all of the autotext feature sets. Forall feature sets, the difference between the “-ident” and10Running totals are only computed for numeric features if

the result is interpretable, e.g., for turn duration, but not fortempo. Running averages for text-based features additionallyinclude a “# turns so far” feature and a “# essays so far” feature.

“+ident” versions was not significant. In sum, we seeagain that the more text-based features the better: addingtext-based features again consistently improves resultssignificantly. We also see that global features performbetter than local features, and while global+local performbetter than local features, global features alone consis-tently yield the best performance.Comparing these results with the results in Tables 1

and 2, we find that while overall the performance ofcontextual non-combined feature sets shows a smallperformance increase over most non-contextual com-bined or non-combined feature sets, there is again aslight decrease in performance across the best resultsin each table. However, there is no significant differ-ence between these best results (alltext+glob-identvs. all-text+speech+ident vs. alltext+ident).Table 5 shows the results of combining speech-based

and text-based contextual feature sets. We investigatedAdaBoost’s performance on the 12 feature sets formedby combining the “speech” acoustic-prosodic set with our“autotext” and “alltext” text-based feature sets, both withand without identifier features, and each separately sup-plemented with local, global, and local+global features.

Feature Set -iden SE +iden SEauto+speech+lo 78.23 0.39 77.30 0.52auto+speech+gl 79.33 0.22 78.84 0.39auto+speech+lo+gl 78.26 0.20 78.01 0.43all+speech+lo 82.44 0.31 82.15 0.56all+speech+gl 84.75 0.32 84.35 0.20all+speech+lo+gl 81.43 0.28 81.04 0.43

Table 5: %Correct on Text+Speech+Context (cross-val.)AdaBoost’s best accuracy of 84.75% is achieved on

the “alltext+speech+glob-ident” combined feature set.This result is not significantly better than the % correctachieved on its “+ident” counterpart, but both results aresignificantly better than the % correct achieved on all 10other feature sets. In fact, all the “alltext” results are sig-nificantly better than all the “autotext” results. Again forall feature sets, the difference between the “-ident” and“+ident” versions was not significant. In sum, addingtext-based features again consistently improves resultssignificantly, and global features alone consistently yieldthe best performance. Although the best result across allexperiments is that of “alltext + speech + glob - ident”,there is no significant difference between the best resultshere and those in our three other experimental conditions.A summary figure of our best results for text (all-

text) and speech alone, then combined with each otherand with our best result for context (global), is shownin Figure 5, for the “+/- ident” conditions; baseline per-formance is also shown. As shown, the accuracy of the“-ident” condition monotonically increases as features

Nikolaos Malandrakis User emotion detection

Page 30: How do you feel? - University of Southern Californiaprojects.ict.usc.edu/nld/cs599s13/LectureNotes/cs599s13dialogue4... · Perceived emotion is combination/contrast of fast and slow

User emotiondetection

NikolaosMalandrakis

IntroductionMotivation

Definition andRepresentation

Common issues

ModalitiesText

Speech Audio

Facial features

Body gestures

Biometrics

Multimodal fusion

PapersEmovoice

Toward

Prosody

Multiple sources

Conclusions

Conclusions

1 Still very much an open problem2 Applications limited by

Questionable performance and generalizationLack of semantic understandingStimulus-reaction link

3 Limited multi-modal work4 Lack of annotated data seems to dictate research

directions

Nikolaos Malandrakis User emotion detection