acoustic cues to emotional speech

Acoustic Cues to Emotional Speech

Julia Hirschberg

(joint work with Jennifer Venditti and Jackson Liscombe)

Columbia University

26 June 2003

Motivation

• A speaker’s emotional state conveys important and potentially useful information– To recognize (e.g. Spoken Dialogue Systems ,

tutoring systems )– To generate (e.g. games)– If we know what emotion is and what aspects of

productions convey different types• Defining emotion in multidimensional space

– Valence: happy vs. sad– Activation: sad vs. despairing

• Features that might convey emotion– Acoustic and prosodic– Lexical and syntactic– Facial and gestural

Previous Research

• Emotion detection in corpus studies– Batliner, Noeth, et al; Ang et al:

anger/frustration in dialogue systems– Lee et al: pos/neg emotion in call center data– Ringel & Hirschberg: voicemail

• … in laboratory studies– Forced choice among 10-12 emotion categories– Sometimes with confidence rating

Problems

• Hard to identify emotions reliably– Variation in ‘emotional’ utterances: production

and perception– How can we obtain better training data?

• Easier to detect variation in activation than in valence– Variation in ‘emotional’ utterances– Large space of potential features– Which are necessary and sufficient?

New methods for eliciting judgments

• Hypothesis: Utterances in natural speech may evoke multiple emotions– Elicit judgments on multiple scales– Tokens from LDC Emotional Prosody Speech

and Transcripts Corpus• Professional actors reading 4-syllable dates

and numbers• disgust, panic, anxiety, hot anger, cold anger,

despair, sadness, elation, happiness, interest, boredom, shame, pride, contempt, neutrality

• Modified category set: – Positive: confident, encouraging, friendly,

happy, interested– Negative: angry, anxious, bored, frustrated, sad– Neutral

• For study: 1 token of each from each of 4 voices plus practice tokens

• Subjects participated over the internet

– 40 native speakers of standard American English with no reported hearing impairment

– 17 female, 23 male, all 18+– 4 random orders rotated among subjects

Correlations between Judgments

sad ang bor fru anx fri con hap int enc

sad .06 .44 .26 .22 -.27 -.32 -.42 -.32 -.33

angry .05 .70 .21 -.41 .02 .37 -.09 -.32

bored .14 -.14 -.28 -.17 -.32 -.42 -.27

frustrated .32 -.43 -.09 -.47 -.16 -.39

anxious -.14 -.25 -.17 .07 -.14

friendly .44 .77 .59 .75

confident .45 .51 .53

happy .58 .73

interested .62

encouraging

What acoustic features correlate with which emotion categories?

– F0: min, max, mean, ‘range’, stdev– RMS: min, max, mean, range, stdev– Voiced samples/all samples (VCD)– Mean syllable length– TILT: spectral tilt (2-1 harmonic over 30ms

window) of highest ampl vowel, nuclear stressed vowel

– Type of nuclear accent, contour, phrasal ending

Results

• F0, RMS and rate distinguish emotion categories by activation (act)– +act correlate with higher F0 and RMS, faster– do not distinguish valence (val)

• Tilt of highest amplitude vowel groups +act emotions with different val into different categories (e.g. friendly, happy, encouraging vs. angry, frustrated)

• Phrase accent/boundary tone also separates +val from -val

– H-L% positively correlated with -val and negatively with +val

– +val positively correlated with L-L% and -val not

Predicting Emotion Categories Automatically

• 1760 judgment/token datapoints (90%/10% training/test)– collapse 2-5 ratings to one

• Ripper machine learning algorithm– Baseline: choose most frequent ranking– Mean performance over all emotions 75% (22%

improvement over baseline)– Individual emotion categories

– Happy, encouraging, sad, and anxious predicted well

– Confident and interested show little improvement

– Which features best predict which emotion categories?

Best Performing Features

Emotion Feature Accuracy Angry F0*, RMS*, TILT*,

VCD 77.3/69.3%

Confident F0_range, F0_mean 76.1/75.0% Happy F0_min 81.3/57.4% Interested F0_stdev 75.6/69.9% Encouraging VCD 73.9/52.3%

Sad F0_max 81.3/61.9%

Anxious Tilt_RMS 78.4/55.7%

Bored Tilt_RMS 80.1/66.5%

Friendly Tilt_stress 75.0/59.1%

Frustrated F0_max 75.0/59.1%

Conclusions

• New features to distinguish valence: spectral tilt and prosodic endings

• New understanding of relations among emotion categories– Judgments– Features

Current/Future Work

• Use ML to rank rather than classify (RankBoost)• Eye-tracking task, matching tokens to ‘emotional’

pictures– Web survey to ‘norm’ pictures– Layout issues

acoustic cues to emotional speech

Documents

val val

different val

posneg emotion

emotion categoriessometimes

different categories

ang et

activation act act correlate

ldc emotional prosody