low-level fusion of audio and video feature for multi-modal emotion recognition

15
Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition Chair for Image Understanding and Knowledge- based Systems Institute for Informatics Technische Universität München Sylvia Pietzsch [email protected]

Upload: ulric

Post on 10-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition. Chair for Image Understanding and Knowledge-based Systems Institute for Informatics Technische Universität München Sylvia Pietzsch [email protected]. Overview. Video low-level descriptors - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

Low-Level Fusion of Audio and Video Feature for Multi-modal

Emotion Recognition

Chair for Image Understanding and Knowledge-based Systems

Institute for Informatics

Technische Universität München

Sylvia Pietzsch

[email protected]

Page 2: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 2/15Technische Universität MünchenSylvia Pietzsch

Overview Video low-level descriptors

Model-based image interpretation Structural features Temporal features

Audio low-level descriptors

Combining video and audio descriptors

Experimental results

Conclusion and outlook

Page 3: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 3/15Technische Universität MünchenSylvia Pietzsch

Model-based Image Interpretation The model

The model contains a parameter vector that represents the model’s configuration.

The objective function Calculates a value that indicates how accurately a parameterized model matches an image.

The fitting algorithm Searches for the model parameters that describe the image best, i.e. it minimizes the objective function.

Page 4: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 4/15Technische Universität MünchenSylvia Pietzsch

Local Objective Functions

Page 5: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 5/15Technische Universität MünchenSylvia Pietzsch

Ideal Objective FunctionsP1: Correctness property:

Global minimum corresponds to the best fit.P2: Uni-modality property:

The objective function has no local extrema. ¬ P1 P1

¬P2

P2

Don’t exist for real-world images

Only for annotated images: fn( I , x ) = | cn – x |

Page 6: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 6/15Technische Universität MünchenSylvia Pietzsch

Learning the Objective Function

x x x xx

xxx x xxx x x x

x x xx x

x xx x x x x

x xxx x

Ideal objective function generates training data Machine Learning technique generates calculation rules

Page 7: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 7/15Technische Universität MünchenSylvia Pietzsch

Skin Color Extraction Location of contour

lines and skin colored parts

Adaptive to image context conditions

orig

ina

l

ima

ge

fixed

classifie

r

ad

ap

ted

classifie

r

Correctly detected pixels: fixed classifier: 90.4% 74.8% 40.2% adapted classifier: 97.5% 87.5% 97.0%

Page 8: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 8/15Technische Universität MünchenSylvia Pietzsch

Structural Features Deformation parameters describe a distinctive

state of the face.

Page 9: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 9/15Technische Universität MünchenSylvia Pietzsch

Temporal Features Facial expressions emerge from muscle activity.

Optical flow vectors are calculated at equally distributed feature points connected to the shape model.

Page 10: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 10/15Technische Universität MünchenSylvia Pietzsch

Audio Low-level Descriptors Aiming at independence of phonetic content and speaker Coverage of prosodic, articulatory, and voice quality aspects 20ms frames, 50% overlap, Hamming window function

Zero crossing rate (ZCR) Pitch 7 formants Energy Spectral development Harmonics-to-Noise-Ratio (HNR) Durations of voiced sounds by HNR Durations of silences by bi-state energy

SMA filtering of LLDs Addition of 1st and 2nd order LLD regression coefficients

Page 11: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 11/15Technische Universität MünchenSylvia Pietzsch

Combining Audio and Video LLDs Time series constructed for LLDs (audio, video

separately)

Application of functionals to combined low-level descriptors Linear moments (mean, std. deviation) Quartiles Durations

Resulting feature vector: 276 audio features 1048 video features

SVM

Page 12: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 12/15Technische Universität MünchenSylvia Pietzsch

Experimental Results (1) Database: Airplane Behavior Corpus

Guided storyline 8 subjects (25 to 48 years old) 11.5 hours of video in total

10-fold stratisfied cross validation

Feature pre-selection by SVM-SFFS (sequential forward floating search)

Audio Video Audiovisual

Features [#] 92 156 200

Accuracy [%] 73.7 61.1 81.8

Page 13: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 13/15Technische Universität MünchenSylvia Pietzsch

Experimental Results (2)

Main confusions: neutral, nervous cheerful, intoxicated

Aggressive behavior recognized best

Page 14: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 14/15Technische Universität MünchenSylvia Pietzsch

Conclusion and Outlook Combined feature set superior over individual

audio or video feature set

Future work: Investigation on further data sets Comparison to late fusion approaches Performance of asynchronous feature fusion Application of hierarchical functionals

Page 15: Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

2008, January 23rd 15/15Technische Universität MünchenSylvia Pietzsch

Thank you!