emotion in music task at mediaeval 2014

Emotion in Music: Task Overview

Anna Aljanaki1 Mohammad Soleymani2

Yi-Hsuan Yang3

1Utrecht University, Netherlands2University of Geneva, Switzerland

3Academia Sinica, Taiwan

16-17 October, MediaEval 2014

Task definition

Description

I A benchmark for music emotion recognition systems(similar but different from MIREX)

I Focusing on audio analysis (optionally, metadata)

Two subtasks

I Dynamic task (required): predict arousal and valencevalues for a song every 0.5s.

I Feature design task: design new or rework existing audiofeatures to estimate emotion for the whole 45s musicalexcerpt or dynamically.

Ground truth

Development set

I Collected for Emotion in Music brave new task in 2013.I 744 files.I 10 annotators per file.

Test set

I Additional data collected in 2014.I 1000 files.I 10 annotators per file.

Ground truth. Music

I 1744 musical excerpts of 45 seconds (randomly sampled)from Free Music Archive (freemusicarchive.org).

I Curated music licensed under Creative Commons.I Manually checked for quality.I 10 genres: Rock, Pop, Electronic, Hip-Hop, Classical, Soul

and RnB, Country, Folk, International, Jazz

Ground truth. Annotations.

Collecting annotations.

I Amazon Mechanical Turk (mturk.com).I 10 Mechanical Turk workers annotated each song.I We averaged 10 annotations and provided to participants:

I Continuous annotations of valence and arousal (1 labelevery 1/2 second).

I Static annotations of valence and arousal for each file(independent from continuous).


Worker Instructions on Valence and Arousal SpaceThe workers were given the following instructions to introducevalence-arousal space to them.

I Valence refers to the degree of positive or negativeemotions one experiences from a given piece of music.

I Positive valence: happiness, joy, excitement.I Negative valence: sadness, fear, anxiety, anger.

I Arousal refers to the intensity of the music clip.I High arousal: loud, energetic, emotionally engaging.I Low arousal: quiet, peaceful, repetitive.


Annotation Interface


Some statistics

I 250 out of 424 workers (59%) passed the qualification test.I It took annotators 10.5 minutes on average to complete the

task (3 songs), and we payed 0.40$ per task.I 99% of time the song was unfamiliar to the annotator.I In general, the music was enjoyed by annotators (on a

scale from 1 to 5, mean liking=3.32 ± 1.22, median=4)


Static annotations.A measure of inter-annotator agreement - Krippendorf’s alpha:

I Valence - 0.22I Arousal - 0.37


Dynamic annotations.A measure of inter-annotator agreement - Kendall’s W afterdiscarding first 15 seconds:

I Valence - 0.16 ± 0.11I Arousal - 0.2 ± 0.13

Evaluation

Dynamic subtask evaluationWe use Pearson’s correlation coefficient and RMSE as metrics in thefollowing steps:

1. Calculate Pearson’s rho between predictions and ground truthfor each song separately.

2. Average across songs separately for valence and for arousal.

3. Rank all submissions for each dimension based on the averagedrho.

4. In case the difference based on the one sided Wilcoxon test isnot significant (p>0.05), we use RMSE to break the tie.

5. If the ranking changed, we do significance test betweenneighbouring pairs again (bubble sort).

Feature design subtask evaluationSame procedure, but Pearson’s rho is calculated for all the songs intest set at once.

Baseline

The organizers decided not to submit and only provide a simplebaseline that participants should beat.

I Five features: Spectral Flux, HCDF (harmonic changedetection function), loudness, roughness and zero crossingrate.

I Linear Regression

Results - Arousal

7 teams crossed the finish line, 6 teams beat the baseline (atleast for arousal).

Dynamic task

Rank Team Arousalρ RMSE

1 TUMMISP 0.35 ± 0.45 0.1 ± 0.052 SAIL 0.28 ± 0.50 0.13 ± 0.073 UoA 0.21 ± 0.57 0.08 ± 0.054 Beatsens 0.23 ± 0.56 0.12 ± 0.055 Rainbow 0.18 ± 0.60 0.12 ± 0.076 THUHCSIL 0.17 ± 0.41 0.12 ± 0.057 Baseline 0.18 ± 0.36 0.14 ± 0.068 Average baseline 0 0.39 ± 0.03

Results - Valence

Dynamic taskThe teams highlighted in bold beat the baseline, other teamsare in the same rank with it.

Rank Team Valenceρ RMSE

1 TUMMISP 0.20 ± 0.49 0.08 ± 0.052 Beatsens 0.12 ± 0.55 0.09 ± 0.053 SAIL 0.15 ± 0.5 0.10 ± 0.064 UoA 0.17 ± 0.5 0.14 ± 0.075 THUHCSIL 0.10 ± 0.37 0.09 ± 0.055 Rainbow 0.07 ± 0.29 0.10 ± 0.065 Baseline 0.11 ± 0.34 0.10 ± 0.066 Average baseline 0 0.34 ± 0.03

Results

Only one team designed new features.

Feature design - static evaluation.Arousal Valence

ρ2 RMSE ρ2 RMSESAIL 0.53 0.32 0.28 0.27

Feature design - dynamic evaluation.Arousal Valenceρ RMSE ρ RMSE

SAIL 0.22 0.12 0.11 0.09

Results

Dynamic runs - Arousal.

Results

Dynamic runs - Valence.

Approaches

Beatsens

I 54 features from MIRToolbox.I Annotations are modeled as a continuous conditional

random field (CCRF) process.I SVR is used as base classifier.I Best performance is achieved by a combination of spectral,

dynamic and rhythmic features, of which the mostimportant were MFCCs.

Approaches

SAILHave designed 3 types of new features

1. Compressibility features2. Median Spectral Band Energy3. Spectral Centre of Mass

Use Partial Least Squares Regression in combination withHaar coefficients to predict the dynamic ratings based onfeatures from the whole song.

Acknowledgments

emotion in music task at mediaeval 2014

Software

static annotations of

dynamic annotations

high arousal

low arousal

positive valence

arousal spacethe workers

music clip

introducevalencearousal