emobgm: estimating sound's emotion for creating slideshows...

EmoBGM: estimating sound’s emotion forcreating slideshows with suitable BGM

Cedric Konan, Hirohiko Suwa, Yutaka Arakawa, Keiichi YasumotoGraduate School of Information Science

Nara Institute of Science and Technology (NAIST)Ikoma, Nara 630-0192, JAPAN

{konan.cedric.js5, h-suwa, ara, yasumoto}@is.naist.jp

Abstract—This paper presents a study about estimatingthe emotions conveyed in clips of background music(BGM) to be used in an automatic slideshow creationsystem. The system we aimed to develop, automaticallytags each given pieces of background music with themain emotion it conveys, in order to recommend themost suitable music clip to the slideshow creators, basedon the main emotions of embedded photos. As a firststep of our research, we developed a machine learningmodel to estimate the emotions conveyed in a musicclip and achieved 88% classification accuracy with cross-validation technique. The second part of our work involveddeveloping a web application using Microsoft Emotion APIto determine the emotions in photos, so the system can findthe best candidate music for each photo in the slideshow.16 users rated the recommended background music for aset of photos using a 5-point likert scale and we achievedan average rate of 4.1, 3.6 and 3.0 for the photo sets 1, 2,and 3 respectively of our evaluation task.

I. INTRODUCTION

Music is ubiquitous in everyday life. It is in thebackground in movies, supermarkets, elevators etc. aswe go about our everyday tasks. Some studies showthat music can provoke both positive and negativebehaviors[1][8] and promote attention and wakefulnesswhen performing tasks. It can also affect how we per-ceive the emotions observed in an image or video. Forinstance, people viewing photographs of smiling faceswhile listening to happy music will rate the emotionsobserved in the photos as happier than those viewingthe photos without background music. This ability ofmusic to enhance the emotional impact of a photo or ascene has led to its prevalence in movies, and more andmore people are creating slideshows of their favoritephotos set to background music as mementos.

In order to preserve the emotional memories of theevents in photos, we need to embed music that conveysa similar emotion to the one depicted in the images.However, how can we accurately choose this suitablebackground music for the perfect video memory? Thisis the main question we aim to answer in this paper.

Various researchers have focused on both assigningemotions to pieces of music and extracting emotions

from media files. The used techniques have ranged fromtextual analysis[11], audio patterns and sound frequencyanalysis[5] to image processing combined with machinelearning, and have resulted in high levels of classifi-cation accuracy. However, the studies do not addressthe problem of matching images and background musicbased on emotions conveyed. We build upon this workby developing a system that combines both emotion ex-traction from images and emotion assignment to music,to recommend appropriate music clips for a slideshowof pictures. Our work focuses on two factors: (i) howto assign background music clips despite the absenceof lyrics or user’s annotation and (ii) user evaluation ofthe system.

We started our work with a tagging task where250 participants tagged 50 music clips from a set of14 different emotions. We developed EmoBGM, anemotion estimation model for background music, usingaudio features that were extracted with JAudio[7] andthe emotion previously tagged. We then conducted asubjective evaluation of the resulting model. 16 usersrated, using a 5-likert scale, the recommended back-ground music for 3 sets of photos we provided. Thebackground music database, embedded in the evaluationsystem, had been analyzed by our model (EmoBGM)and tagged automatically with emotion tags. The usersrated the recommendations with an average of 4.1, 3.6and 3.0 for the photo sets 1, 2, and 3 respectively,indicating a good level of appropriateness.

The rest of the paper is arranged as follows. SectionII presents related work followed by a description ofthe challenges of our study in section III. Section IVthen presents the methodology used in this study. Wethen conclude the paper with the evaluation (results anddiscussion) part in section V, followed by the conclusionand future work.

II. RELATED WORK

Various approaches have been used to determine themain emotion provoked by a piece of music. Eerolaet al.[2] used acoustic features such as intensity and

1st Workshop on emotion awareness for pervasive computing with mobile and wearable devices 2017

978-1-5090-4338-5/17/$31.00 ©2017 IEEE

BackgroundMusic’semotionmodelcreation

Taggingtask(A) Audiofeaturesextraction(B) EmoBGM (C)

Audiofeaturesextraction

Finalsystemarchitecture

BGM

Image

EmoBGM (C)BGMtaggedwith

emotion

Emotioninimageanalyzer(D) Imagetaggedwith

emotionMatching

BGMrecommendationsystem(E)

250participants50musicclips14emotiontags

Fig. 1. Proposed method

rhythm of a music piece, with Gaussian Mixture Modelsachieving an accuracy of 85%. Laurier et al. [4] onthe other hand used text information such as lyrics inaddition to the acoustic information. Poria et al.[10]proposed a multimodal sentiment analysis which usesweb videos, harvested on Youtube, Vimeo and Vide-oLectures, to demonstrate a model that uses audio,visual and textual modalities as sources of information.They succeeded to fuse the features information fromaudio, video and text using feature level fusion byconcatenating the feature vectors of all three modalitiesto form a single long feature vector. You et al.[12] alsoused social networks as data provider for their research.They used convolutional neural network for imagesentiment analysis and succeeded to get a minimumof 75% of precision. The research mentioned abovetackled some challenges we faced in our work. Howeverthe main difference with our work lies in the fact thatwe focus on background music which does not havelyrics. Therefore, we have to focus only on core audiofeatures such as MFCC-OSD (Mel-Frequency CepstralCoefficients Overall Standard Deviation), Area methodof moments of MFCCs OSD and Area Method ofMoments of MFCCs Overall Average. We extractedthese features using the feature extractor JAudio[7]. Inaddition, as we are targeting recommendation of back-ground music for personal photos, we want to accuratelyguess the sentiment without relying on comments orannotations to limit the user burden. Some applicationsthat are currently available in app markets (Google playor Apple App store) try to achieve a similar aim to ours

by creating automatic slideshows from personal photosaccompanied by background music. Such applicationsinclude Google photos1 which curates user’s photosinto beautiful videos accompanied by a generic pieceof instrumental music that might go along with anykind of photos. They generally provide the option tochange the background music in case of mismatch butthis may bring some burden to the user who is lookingfor a good video rapidly. In addition, the use of thesame generic song for various videos may remove theemotional memories which the user wishes to keep,and may also bore the user. We enhance the servicesand options provided by such applications throughconsidering emotion.

III. PROBLEM AND CHALLENGES

Due to the extreme heterogeneity of human senti-ment, it is particularly difficult to guess what a personfeels while listening to a certain music. For example,the person’s feeling can be influenced by his culturalbackground and his current state of mind. However, tak-ing the example of movies, we notice that the directoralways manages to select suitable music to accompanyeach scene, and therefore it is possible for humans toeasily recognize what music suits what types of scenes.We exploit this insight in our research. Some otherproblems are: what features to use in order to makethe most accurate model possible? how to determinethose features and what features, among the hundreds of

1https://play.google.com/store/apps/details?id=com.google.android.apps.photos


music features available, influence the most the emotionin the music? We also face the challenge to determinehow to match images and music in our final application.

IV. PROPOSED METHOD

In order to achieve automatic background musicassignment to a slideshow of images, the followingissues need to be addressed: (i) tagging each piece ofBGM music with an emotion tag and (ii) determiningthe main emotion conveyed in a scene. In our research,we focus on developing a prediction model for theformer and use already available tools to perform thelatter. Figure 1 shows the method we used to solve ourproblem.

A. Background Music’s emotion model creation

The first part of our proposed method dealt with themethod to create a background music emotion model.We conducted an experiment where users tagged a setof background music with a list of emotions that weprovided. Then, we extracted audio features from themusic files and used them to create a model via machinelearning. Figure 3 presents a detailed summary of thebackground music model creation process.

1) Tagging task (A): To tag a music piece withan emotion, we first needed to determine the mainemotions people feel when listening to various kindsof music scores.

Several researches [3] [6] classified the human emo-tion into categories. Parrot [9] divided them into 6primary emotions (Love, joy, surprise, anger, sadnessand fear). These primary emotions are then split intosecondary and tertiary emotions, which results in alist of 100+ emotions. This extensive list provides acomprehensive and easily understandable structure ofhuman emotion. In an initial survey, we found thatpeople could more easily tag music when provided withthe comprehensive list than when provided with thebasic emotion classes alone. We therefore used Parrot’slist for our work.

Presenting users with 100+ emotions as tags can leadto confusion and make the task tedious. We thereforesought to identify the most frequently-felt emotions andremove the others from the list. We asked 6 nativeEnglish speakers in our university to listen to a musicdatabase and highlight the emotions they thought bestexpressed the feelings they felt while listening to eachclip. The result of this task gave us 14 emotion tagslisted in table I that we used in our research.

The sound dataset we used in our research is com-posed of 50 music files split into 5 set of 10. Theyare background music of famous movies of differentgenres: Comedy, Action, Animation, Drama etc. We cutthe background music with the rule that one background

TABLE ILIST OF EMOTION TAGS

Primary emotions Secondary emotionsLove Love, Tenderness, PassionAnger Anger, FuryHappiness Joy, Excitement , thrillSadness Sadness, Suffering,DespairFear Fear, Anxiety, Horror

music must cover only one scene, with no lyrics andshould last for less than 40 secs.

We used Amazon Mechanical Turk to recruit 250turkers that will help us to tag the background musicdataset. Via a web application that we developed (Figure2), we asked the turkers to listen to background musicin their set and to select the appropriate emotions fromthe 14 emotions listed. Each music clip could be taggedwith a maximum of 3 emotions. The music looped untilthey choose at least one tag and validated their choiceby clicking the ”Next Song” button. We targeted 50 tagsper song (i.e. 50 workers).

EmotiontagslistMusic player

Fig. 2. Tagging task application

2) Audio features extraction (B): After the taggingtask, we extracted the acoustic features (MFCC, LPC,Beat sum etc.) of each music file using JAudio.

We used only acoustic features for 2 main reasons.First, to allow users to include instrumental musicpieces including their own compositions which lacktextual information and second because prior researchhas shown that acoustic information alone does providea high accuracy of recognition of emotion in musicpieces[10].

3) EmoBGM (C): We used WEKA to input theacoustic features extracted and built a classificationmodel to automatically tag music files with an emotion.We created a model using different sets of featuresand different algorithms (random forest, J48, supportvector machine). Table II summarizes the result weobtained. The best classification accuracy is reachedwhen using random forest algorithm with the 9 best


Weka’sAttributeSelectionBestFirst Algorithm

Randomforestalgorithm

Arff filewith99audiofeaturesextractedincludingemotion

Arff filewith9audiofeaturesextracted

DatasetWavfiles

BGMemotionanalyzer

Fig. 3. Summary of music model creation process

features computed by the attribute selection BestFirstalgorithm.

TABLE IICOMPARISON OF RESULT OF CLASSIFICATION USING DIFFERENT

ALGORITHM

J48 SVM Random forest20 random features 86% 80% 86%9 random features 74% 68% 80%5 among best features 84% 80% 88%9 best features 84% 88% 88%

B. Implementation of the evaluation applicationThe second part of our work dealt with the evaluation

of the resulted model. The main question we wantto answer is: Is our model capable of recommendinga background music to a set of photos conveying acertain emotion? We performed a subjective evaluationto verify whether the users found the recommendationof background music based on emotion useful and ifthe result provided suits them.

To do this, we built a web application which embeds aset of background musics that our model analyzed andclassified per category with a confidence percentage.Since we want to recommend a background music to aset of images based on emotion, we need an automaticimage’s emotion analyzer to output the main emotionconveyed in a given image. Then, based on emotionand confidence estimation, our application recommendsa suitable background music.

As automatic image emotion analysis system, weused Microsoft Emotion API that we first evaluate to

confirm its ability to guess emotion in images. Wefollowed the process explained below.

1) Microsoft Emotion Api Evaluation (D): MicrosoftEmotion API is a service provided by Microsoft cog-nitive services. From a photo, this service is able toanalyze the faces and give an estimation of the emotionof the person in the image. It classifies the facialemotion into 8 categories (Anger, contempt, Happiness,Disgust, Neutral, Sadness, Surprise and Fear). Figure 4shows an example of the api’s output from their testplatform2.

Fig. 4. Example of result given by Microsoft Emotion API

To evaluate the prediction of Microsoft emotion API,we built another web application (Figure 5), where theuser had to choose the emotion he thinks the person inthe photo is feeling. Among the 10 candidate photos,some are from our lab photo library and some are fromthe internet.

Phototobetaggedwithemotion

Listoftagstochoose

Fig. 5. Microsoft Emotion Api evaluation app

24 people participated in this task. 18 people out of24 were present when some of the photos were takenso they knew the emotion felt by the people at thattime, a fact we used as ground truth. We asked theparticipants to focus on the facial expressions and noton the whole scene since Microsoft emotion API onlyuses facial expressions. The results, summarized in tableIII, show that Microsoft Emotion API is very accurate,guessing in 8 cases out of 10 the major emotion in theimage provided.

2https://www.microsoft.com/cognitive-services/en-us/emotion-api


TABLE IIIMICROSOFT EMOTION API EVALUATION TASK RESULT

Image Microsoft emotion APIGuess

Human Guess

Image 1 Surprise 71% Surprise 67%Image 2 Neutral 63% Happiness 50% Neutral

33%Image 3 Anger 99% Anger 99%Image 4 Happiness 100% Happiness 99%Image 5 Disgust 55%; Anger

34%Disgust 41% Sadness33%

Image 6 Sadness 100% Sadness 67%Image 7 Happiness 99% Happiness 67%Image 8 Surprise 51%; Fear 47% Fear 63%Image 9 Happiness 99% Happiness 99%Image10

Neutral 99% Neutral 45%; Contempt36%

2) Best background music selection: Our applica-tion embeds some candidate background music withtheir emotion estimation given by our model. To getthe recommended background music, the user uploadshis photos. The system sends a request to Microsoftemotion API and gets the emotion estimation of eachphoto. Using this estimation, we calculate the matchingscore of each candidate BGM following this equation:

ScoreBGM(i) = |Simg(i) − Pbgm(i)| (1)

where Simg(i) is the percentage of the strongestsentiment detected in the image and Pbgm(i) is thepercentage of the strongest sentiment of the candidatebackground music.

As shown in Figure 6, which is the screenshot of thefinal system, the 3 background music with the highestscores are selected and recommended via a play buttondisplayed below each photo. For the photos tagged asneutral, we recommended a random background musicdownloaded from the free music archive3.

Strongestemotionestimatedinthepicture

ButtonstoplaytherecommendedBGMs

Fig. 6. EmoBGM evaluation App

3http://www.freemusicarchive.org/

SET 1 SET 2 SET 3

1

2

3

4

Photo sets

liker

tsc

ale

Fig. 7. Average rating (on a scale of 5) for how appropriate therecommended BGM is for the set of pictures

C. Evaluation scenario

To check whether our system succeeded to matchthe background music with the emotions depicted inpictures, we tested it on 3 different sets of photos, witheach set containing 5 photos. The photos in the firstset were a wedding scene showing the bridal couplesmiling directly into the camera. The second set wasof people on a sightseeing trip in Kanazawa, Japan.Some photos showing the people interacting with thesurroundings and each other while others showed themsmiling into the camera at a restaurant. The final setshowed people talking and playing in a soccer game.None of the people pictured were looking into thecamera. 16 people participated in the evaluation. Eachparticipant completed 3 tasks: First, they answered thefollowing question: Do you think that a system thatanalyses the emotion in the photos and suggests themusic accordingly is necessary? Next, they rated therecommended music(s) for each set of photos using a5-point likert scale and gave a comment on their feelingson the association between the music and photo set (i.e.reason behind the rating given). Finally, they answeredthe question: What information or features are importantto provide better matching results?

D. Evaluation results

For the first question, 100% of the participants agreedthat a system for recommending background musicaccording to the emotion is necessary and will savea lot of time while creating a slideshow. Figure 7shows the average user rating on how appropriate therecommended music clips are based on the emotionsobserved in the photos (task 2 in evaluation). Out of amaximum of 5, users gave ratings of 4.1, 3.6 and 3 for


photo sets 1, 2 and 3 respectively. The key phrases fromthe comments given on the reasons behind the ratingsare summarized below. We discuss them in more detailin the discussion section.

E. Discussion

We observed a decrease in rate from 4.1 for thefirst set (wedding ceremony) to 3 for the third set(soccer game). Indeed, the users were very excitedabout the first set and appreciated the accuracy of therecommended music which, according to them, matcheswith the emotion conveyed in the images.

Then their appreciation rate dropped because of thedifference in emotional impact of each set. The photoset 1, depicting a wedding ceremony and rated as happyby the image emotion analyzing system, caused a peakof excitation within the users. Hence the rate wasrelatively high in the first set. The second photo set,presenting a lab trip, aroused less excitement amongusers especially as the pictures do not necessarily showthe face of the people. The rate of the third set isrelatively low. This is due to the fact that few ofour users play soccer or team sports. It was thereforedifficult for them to understand the emotion of thepeople present in the photos and decide whether abackground music is suitable to express what the personin the photo was feeling.

It emerges from these analysis that, besides the majoremotion conveyed in a photo, the context recognition ofthe scene is also very important to match accurately theimages and the background music.

From the results and comments we obtained, we areable to confirm that our method succeeded to accu-rately estimate the emotion in a piece of backgroundmusic. However, to increase the accuracy and a betterrecommendation for slideshows, we must extend theclassification to subclasses such as parties or sportevents for the happiness category.

V. CONCLUSION

Finding out the emotion someone feels out froma media file is a very difficult task. Nevertheless,the human senses succeed somehow to distinguish thedifferent emotions conveyed in photos or music. Weexploited this insight to create a machine learning modelcapable of determining which emotion is conveyed bya piece of background music. Despite the lack of addi-tional information such as lyrics or annotations, whichcould give us a clue for our emotion recognition model,we succeeded to reach 88% of accuracy with crossvalidation techniques, and an average rate of 4.1, 3.6and 3.0 for the photo sets 1, 2, and 3 respectively of thesubjective evaluation we conducted, demonstrating theeffectiveness of our proposed system. For future work,

we plan to improve our music model by adding thepossibility to label the background music with genre andpossible suitable event automatically. This will permitto our system a better recommendation even insidecategories. Our image emotion analyzer also can beimproved by using Google Cloud Vision 4 and photos’exif information. Besides detecting faces and classifyingthem in emotion category, Google cloud vision detectsindividual objects in the photo and is able to providethe relevant words (labels) that can help to understandthe context of the photo as well as the GPS data anddate info etc. that can be obtained from the exif files ofthe photos. Next, we will define the rules to choose thesuitable background music for a whole set of relatedphotos.

ACKNOWLEDGMENT

This work is partly supported by JSPS KAKENHIGrant Number 26700007.

REFERENCES

[1] Charles S Areni and David Kim. The influence of backgroundmusic on shopping behavior: classical versus top-forty music ina wine store. NA-Advances in Consumer Research Volume 20,1993.

[2] Tuomas Eerola, Olivier Lartillot, and Petri Toiviainen. Predic-tion of multidimensional emotional ratings in music from audiousing multivariate regression models. In ISMIR, pages 621–626,2009.

[3] Paul Ekman, Wallace V Friesen, Maureen O’Sullivan, AnthonyChan, Irene Diacoyanni-Tarlatzis, Karl Heider, Rainer Krause,William Ayhan LeCompte, Tom Pitcairn, Pio E Ricci-Bitti,et al. Universals and cultural differences in the judgments offacial expressions of emotion. Journal of personality and socialpsychology, 53(4):712, 1987.

[4] Cyril Laurier, Jens Grivolla, and Perfecto Herrera. Multimodalmusic mood classification using audio and lyrics. In MachineLearning and Applications, 2008. ICMLA’08. Seventh Interna-tional Conference on, pages 688–693. IEEE, 2008.

[5] Tao Li and Mitsunori Ogihara. Detecting emotion in music. InISMIR, volume 3, pages 239–240, 2003.

[6] David Matsumoto. More evidence for the universality of acontempt expression. Motivation and Emotion, 16(4):363–368,1992.

[7] Cory McKay, Ichiro Fujinaga, and Philippe Depalle. jaudio: Afeature extraction library. In Proceedings of the InternationalConference on Music Information Retrieval, pages 600–3, 2005.

[8] Ronald E Milliman. The influence of background music on thebehavior of restaurant patrons. Journal of consumer research,13(2):286–289, 1986.

[9] W Gerrod Parrott. Emotions in social psychology: Essentialreadings. Psychology Press, 2001.

[10] Soujanya Poria, Erik Cambria, Newton Howard, Guang-BinHuang, and Amir Hussain. Fusing audio, visual and textualclues for sentiment analysis from multimodal content. Neuro-computing, 174:50–59, 2016.

[11] Dan Yang and Won-Sook Lee. Music emotion identificationfrom lyrics. In Multimedia, 2009. ISM’09. 11th IEEE Interna-tional Symposium on, pages 624–629. IEEE, 2009.

[12] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang.Robust image sentiment analysis using progressively trainedand domain transferred deep networks. arXiv preprintarXiv:1509.06041, 2015.

4https://cloud.google.com/vision/


emobgm: estimating sound's emotion for creating slideshows...

Documents