acoustic cues to emotional speech
DESCRIPTION
Acoustic Cues to Emotional Speech. Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003. Motivation. A speaker’s emotional state conveys important and potentially useful information - PowerPoint PPT PresentationTRANSCRIPT
Acoustic Cues to Emotional Speech
Julia Hirschberg
(joint work with Jennifer Venditti and Jackson Liscombe)
Columbia University
26 June 2003
Motivation
• A speaker’s emotional state conveys important and potentially useful information– To recognize (e.g. Spoken Dialogue Systems ,
tutoring systems )– To generate (e.g. games)– If we know what emotion is and what aspects of
productions convey different types• Defining emotion in multidimensional space
– Valence: happy vs. sad– Activation: sad vs. despairing
• Features that might convey emotion– Acoustic and prosodic– Lexical and syntactic– Facial and gestural
Previous Research
• Emotion detection in corpus studies– Batliner, Noeth, et al; Ang et al:
anger/frustration in dialogue systems– Lee et al: pos/neg emotion in call center data– Ringel & Hirschberg: voicemail
• … in laboratory studies– Forced choice among 10-12 emotion categories– Sometimes with confidence rating
Problems
• Hard to identify emotions reliably– Variation in ‘emotional’ utterances: production
and perception– How can we obtain better training data?
• Easier to detect variation in activation than in valence– Variation in ‘emotional’ utterances– Large space of potential features– Which are necessary and sufficient?
New methods for eliciting judgments
• Hypothesis: Utterances in natural speech may evoke multiple emotions– Elicit judgments on multiple scales– Tokens from LDC Emotional Prosody Speech
and Transcripts Corpus• Professional actors reading 4-syllable dates
and numbers• disgust, panic, anxiety, hot anger, cold anger,
despair, sadness, elation, happiness, interest, boredom, shame, pride, contempt, neutrality
• Modified category set: – Positive: confident, encouraging, friendly,
happy, interested– Negative: angry, anxious, bored, frustrated, sad– Neutral
• For study: 1 token of each from each of 4 voices plus practice tokens
• Subjects participated over the internet
– 40 native speakers of standard American English with no reported hearing impairment
– 17 female, 23 male, all 18+– 4 random orders rotated among subjects
Correlations between Judgments
sad ang bor fru anx fri con hap int enc
sad .06 .44 .26 .22 -.27 -.32 -.42 -.32 -.33
angry .05 .70 .21 -.41 .02 .37 -.09 -.32
bored .14 -.14 -.28 -.17 -.32 -.42 -.27
frustrated .32 -.43 -.09 -.47 -.16 -.39
anxious -.14 -.25 -.17 .07 -.14
friendly .44 .77 .59 .75
confident .45 .51 .53
happy .58 .73
interested .62
encouraging
What acoustic features correlate with which emotion categories?
– F0: min, max, mean, ‘range’, stdev– RMS: min, max, mean, range, stdev– Voiced samples/all samples (VCD)– Mean syllable length– TILT: spectral tilt (2-1 harmonic over 30ms
window) of highest ampl vowel, nuclear stressed vowel
– Type of nuclear accent, contour, phrasal ending
Results
• F0, RMS and rate distinguish emotion categories by activation (act)– +act correlate with higher F0 and RMS, faster– do not distinguish valence (val)
• Tilt of highest amplitude vowel groups +act emotions with different val into different categories (e.g. friendly, happy, encouraging vs. angry, frustrated)
• Phrase accent/boundary tone also separates +val from -val
– H-L% positively correlated with -val and negatively with +val
– +val positively correlated with L-L% and -val not
Predicting Emotion Categories Automatically
• 1760 judgment/token datapoints (90%/10% training/test)– collapse 2-5 ratings to one
• Ripper machine learning algorithm– Baseline: choose most frequent ranking– Mean performance over all emotions 75% (22%
improvement over baseline)– Individual emotion categories
– Happy, encouraging, sad, and anxious predicted well
– Confident and interested show little improvement
– Which features best predict which emotion categories?
Best Performing Features
Emotion Feature Accuracy Angry F0*, RMS*, TILT*,
VCD 77.3/69.3%
Confident F0_range, F0_mean 76.1/75.0% Happy F0_min 81.3/57.4% Interested F0_stdev 75.6/69.9% Encouraging VCD 73.9/52.3%
Sad F0_max 81.3/61.9%
Anxious Tilt_RMS 78.4/55.7%
Bored Tilt_RMS 80.1/66.5%
Friendly Tilt_stress 75.0/59.1%
Frustrated F0_max 75.0/59.1%
Conclusions
• New features to distinguish valence: spectral tilt and prosodic endings
• New understanding of relations among emotion categories– Judgments– Features
Current/Future Work
• Use ML to rank rather than classify (RankBoost)• Eye-tracking task, matching tokens to ‘emotional’
pictures– Web survey to ‘norm’ pictures– Layout issues