reza lotfian and carlos busso

1
Reza Lotfian and Carlos Busso Multimodal Signal Processing (MSP) Laboratory University of Texas at Dallas Motivation Experimental Evaluation Time Alignment Lexical dependent models improve emotion recognition accuracy Practical approaches can only model small lexical units Phonemes, syllables or few key words Can we leverage Text-to-Speech (TTS) system? IDEA: Synthetic speech as neutral reference model Contrast different acoustic features Unveil local emotional changes Proposed Approach Feature set Arousal [%] Valence [%] Baseline HLD 63.8 59.9 Duration HLD 57.7 56.8 Relative HLD 63.3 58.6 Relative HLD & Duration HLD 63.5 60.3 Baseline HLD & Relative HLD 64.1 61.6 Baseline HLD & Relative HLD & Duration HLD 66.0 62.7 Underlying assumptions: Synthetic speech is a good representative of neutral speech Lexical content (transcript) is available (ASR may required) TTS generates synthetic speech with same lexical content Festival - cluster unit selection Alignment of original and synthetic speech Step 1: Match word boundary Forced alignment Step 2: DTW to estimate alignment within words MFCCs from original and synthetic speech Feature set 1: Relative features We compare low level descriptors (LLD) Original speech features Aligned synthetic features Subtraction LLDs to generate a trace Estimate high level descriptors (HLD) Arousal Valence SYNTHETIC REFERENCE TARGET SENTENCE Discussion TTS Vowel triangle [Lee et al. 2004] TTS Speech Database Original speech Alignment (DTW+words) Resynthesis Aligned synthetic speech Alignment path Synthetic speech SYNTHETIC REFERENCE ORIGINAL SENTENCE synthetic speech[frame] target speech[frame] 50 100 150 200 250 300 350 300 250 200 150 100 50 ALIGNMENT PATH HLD HLD HLD CLASSIFIER LLD LLD Feature set 2: Duration features Estimated from the alignment path Localized ratio between number of frames Convert speech rate signal into log scale Smooth the curve Hamming filter Database: SEMAINE [McKeown et al., 2012] Natural dyadic interaction (user, operator) Sensitive artificial listener (SAL) framework 10,799 1-sec segments Binary Classification Problems Valence (negative versus positive) Arousal (calm versus active) activation valence Feature Extraction OpenSMILE toolkit IS 2011 set [Schuller et al., 2011] 60 LLDs plus derivates (e.g., MFCCs) Only 17 high level functionals Features set: Baseline HLD: (2040=60x2x17) Duration HLD: (34=1x2x17) Relative HLD: (2040=60x2x17) Feature Selection (2-stage approach) Information gain ratio (2040--> 500) Correlation feature selection (500-->100) Classifiers: Linear kernel SVM (SMO) Leave one subject out (LOSO) Weighted average across speakers Rasta F0 VQ Energy MFCC Spectral 0 5 10 15 20 25 30 Baseline HLD Relative HLD Rasta F0 VQ Energy MFCC Spectral 0 5 10 15 20 25 30 35 Baseline HLD Relative HLD Results: Performance of Relative HLD similar to Baseline HLDs Best performance when all feature sets are considered Absolute gain of 2.1% (arousal) and 2.8% (valence) Proposed features provide complementary information Duration HLDs achieves performance above chances Over 26% of selected features come from Relative HLD Future Directions Re-synthesizing of speech before estimating features Evaluating different speech synthesis approaches Building a family of synthetic speech Acknowledgements: Study funded by NSF (IIS 1329659)

Upload: others

Post on 24-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Reza Lotfian and Carlos Busso Multimodal Signal Processing (MSP) Laboratory

University of Texas at Dallas

Motivation

Experimental Evaluation

Time Alignment §  Lexical dependent models improve emotion recognition accuracy §  Practical approaches can only model small lexical units §  Phonemes, syllables or few key words

§  Can we leverage Text-to-Speech (TTS) system? §  IDEA: Synthetic speech as neutral reference model

§  Contrast different acoustic features §  Unveil local emotional changes

Proposed Approach

Feature set Arousal [%]

Valence [%]

Baseline HLD 63.8 59.9

Duration HLD 57.7 56.8

Relative HLD 63.3 58.6

Relative HLD & Duration HLD 63.5 60.3

Baseline HLD & Relative HLD 64.1 61.6

Baseline HLD & Relative HLD & Duration HLD

66.0 62.7

§  Underlying assumptions: §  Synthetic speech is a good representative of neutral speech §  Lexical content (transcript) is available (ASR may required)

§  TTS generates synthetic speech with same lexical content §  Festival - cluster unit selection

§  Alignment of original and synthetic speech §  Step 1: Match word boundary §  Forced alignment

§  Step 2: DTW to estimate alignment within words §  MFCCs from original and synthetic speech

Feature set 1: Relative features §  We compare low level descriptors (LLD) §  Original speech features §  Aligned synthetic features

§  Subtraction LLDs to generate a trace §  Estimate high level descriptors (HLD)

Arousal Valence

SYNTHETIC REFERENCE

TARGET SENTENCE

Discussion

TTS

Vowel triangle [Lee et al. 2004]

TTS

Speech Database

Original speech

Alignment(DTW+words)

Resynthesis

Aligned synthetic speech

Alignment path

Synthetic speech

SYNTHETIC REFERENCE

ORIGINAL SENTENCE

synthetic speech[frame]

targ

et s

pee

ch[f

ram

e]

50 100 150 200 250 300 350

300

250

200

150

100

50

ALIGNMENT PATH

HLD HLD

HLD CLA

SS

IFIE

R

LLD

LLD

Feature set 2: Duration features §  Estimated from the alignment path §  Localized ratio between number of frames §  Convert speech rate signal into log scale §  Smooth the curve §  Hamming filter

§  Database: SEMAINE [McKeown et al., 2012] §  Natural dyadic interaction (user, operator) §  Sensitive artificial listener (SAL) framework §  10,799 1-sec segments

§  Binary Classification Problems §  Valence (negative versus positive) §  Arousal (calm versus active)

activation valence

§  Feature Extraction §  OpenSMILE toolkit §  IS 2011 set [Schuller et al., 2011] §  60 LLDs plus derivates (e.g., MFCCs) §  Only 17 high level functionals

§  Features set: §  Baseline HLD: (2040=60x2x17) §  Duration HLD: (34=1x2x17) §  Relative HLD: (2040=60x2x17)

§  Feature Selection (2-stage approach) §  Information gain ratio (2040--> 500) §  Correlation feature selection (500-->100)

§  Classifiers: Linear kernel SVM (SMO) §  Leave one subject out (LOSO) §  Weighted average across speakers

Rasta F0 VQ Energy MFCC Spectral0

5

10

15

20

25

30

Baseline HLDRelative HLD

Rasta F0 VQ Energy MFCC Spectral0

5

10

15

20

25

30

35

Baseline HLDRelative HLD

Results: §  Performance of Relative HLD similar to Baseline HLDs

§  Best performance when all feature sets are considered

§  Absolute gain of 2.1% (arousal) and 2.8% (valence)

§  Proposed features provide complementary information

§  Duration HLDs achieves performance above chances

§  Over 26% of selected features come from Relative HLD Future Directions §  Re-synthesizing of speech before estimating features

§  Evaluating different speech synthesis approaches

§  Building a family of synthetic speech

Acknowledgements: Study funded by NSF (IIS 1329659)