[ieee 2012 ieee international multi-disciplinary conference on cognitive methods in situation...

6
Intraindividual and Interindividual Multimodal Emotion Analyses in Human-Machine-Interaction Ronald B¨ ock, Stefan Gl¨ uge, Andreas Wendemuth Cognitive Systems Otto-von-Guericke University Magdeburg P.O. 4120, 39016 Magdeburg, Germany Email: [email protected] Kerstin Limbrecht, Steffen Walter, David Hrabal, Harald C. Traue Medical Psychology Ulm University Frauensteige 6, 89075 Ulm, Germany Email: [email protected] Abstract—Interactive computer systems today interact nearly effortlessly with humans by menu-driven mouse- and text-based input. In case of other modalities like audio and gesture control systems still lack on flexibility. To respond appropriately, these intelligent systems require specific cues about the user’s internal state. Reliable emotion recognition of technical systems is there- fore an important issue in computer sciences and applications. In order to develop an appropriate methodology for emotion analyses, a multimodal study is introduced here. Audio and video signals as well as biopsychological signals of the user are applied to detect intraindividual behavioural prototypes that can be used for predictions of the user’s emotional states. Additionally, in- terindividual differences are considered and discussed. Statistical analyses showed results in most cases with statistical significance of probability value p < 0.05 and an effect size d > 1.05. Index Terms—Multimodality, Biopsychological Analysis, Audio Analysis, FACS, Individual Calibration I. I NTRODUCTION Nowadays, Human-Machine-Interaction (HMI) found its way into human’s life. Especially, through mobile devices like smart-phones and tablet PCs, an increase in the usage of HMI can be seen. These systems support us with helpful information and mediate intentions in a social and technical environment. On the other hand, state-of-the-art technical sys- tems are right now not capable to recognise social signals like human beings, especially, in interactions with several users. Otherwise, a system which is familiar with a certain user might improve the HMI [1]. Therefore, a personalisation is desirable, i.e., the system adapts itself to the user by regarding his behaviour, emotions, and intentions. Specifically, this leads to technologies with companion-like characteristics [2] that can interact with a certain user in a more efficient way independent from the situation and environment. Moreover, the system is a kind of assistant and partner. Right now, personalisation is pro- vided by logging of events and adaptations which are done by the user himself, e.g. dictate systems. Hence, questions occur: How can we get systems, which are recognised as companion- like assistants? And thus, how do they adapt themselves to the user in detail, e.g. regarding the user’s behaviour, emotions, and intentions? In this paper we will focus on one aspect of personalisation: The recognition of the user’s emotion. Emotions are highly complex phenomena that include biopsychological reactions, emotional experiences, expressions, behavioural tendencies, and cognitive evaluations [3], [4], [5]. Due to this complexity a multimodal approach in analyses is necessary and, hence, it is important to consider a variety of channels such as prosody, biopsychologic data, gestures, and mimics [6], [7], [8], [9]. So far, acted material with requested affective behaviour or slightly natural datasets of high quality are used to train systems on either unimodal or multimodal recordings [10]. In this study, we will use a dataset with na¨ ıve recordings and the modalities of choice (cf. Sect. III-B - III-C) include emotion recognition from speech, biopsychological proper- ties, and manually annotated analysis of the facial expres- sions according to the Facial Action Coding System (FACS). For this, we investigate a Wizard-of-Oz (WOZ) experiment called EmoRec-Woz I [11] designed to induce emotions (cf. [12], [13]) according to the pleasure-arousal-dominance space (PAD) [14]. The corpus simulates typical situations found in HMI, inducing emotions by delays, misunderstandings, ignoring commands, time pressure, etc., but also by positive feedbacks. Thus, the user is pushed into each of the eight octants of the PAD space. As the users were not informed about the WOZ characteristic of the experiment and were not actors, the material can be seen as na¨ ıve. Hence, it provides natural HMI and, furthermore, material representing different situations. It is assumed that, especially, in time- critical and stressful situations a user communicates usually in short utterances. This way of speaking can be seen, for instance, in traffic, medical (in case of emergency), police, and military scenarios as well as in dialogues in call-centres [15]. For this, an analysis of elliptical sentences is helpful to provide additional information to a system. Usually, common datasets do not provide such command style utterances. Hence, we decided for EmoRec-Woz I because it does not merely provide aforementioned utterances but also biopsychological and video modalities. In this study, we prove the statistical significance of chosen modalities in emotion recognition, especially in comparison of intraindividual against interindividual classification of emo- tions. Indeed, these evaluations are necessary to get in touch with personalisation and to find a starting point for calibration. 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support, New Orleans, LA 978-1-4673-0345-3/12/$31.00 ©2012 IEEE 59

Upload: harald-c

Post on 10-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2012) - New Orleans, LA, USA (2012.03.6-2012.03.8)]

Intraindividual and Interindividual MultimodalEmotion Analyses in Human-Machine-Interaction

Ronald Bock, Stefan Gluge,Andreas Wendemuth

Cognitive SystemsOtto-von-Guericke University MagdeburgP.O. 4120, 39016 Magdeburg, Germany

Email: [email protected]

Kerstin Limbrecht, Steffen Walter,David Hrabal, Harald C. Traue

Medical PsychologyUlm University

Frauensteige 6, 89075 Ulm, GermanyEmail: [email protected]

Abstract—Interactive computer systems today interact nearlyeffortlessly with humans by menu-driven mouse- and text-basedinput. In case of other modalities like audio and gesture controlsystems still lack on flexibility. To respond appropriately, theseintelligent systems require specific cues about the user’s internalstate. Reliable emotion recognition of technical systems is there-fore an important issue in computer sciences and applications.In order to develop an appropriate methodology for emotionanalyses, a multimodal study is introduced here. Audio and videosignals as well as biopsychological signals of the user are appliedto detect intraindividual behavioural prototypes that can be usedfor predictions of the user’s emotional states. Additionally, in-terindividual differences are considered and discussed. Statisticalanalyses showed results in most cases with statistical significanceof probability value p < 0.05 and an effect size d > 1.05.

Index Terms—Multimodality, Biopsychological Analysis, AudioAnalysis, FACS, Individual Calibration

I. INTRODUCTION

Nowadays, Human-Machine-Interaction (HMI) found itsway into human’s life. Especially, through mobile deviceslike smart-phones and tablet PCs, an increase in the usageof HMI can be seen. These systems support us with helpfulinformation and mediate intentions in a social and technicalenvironment. On the other hand, state-of-the-art technical sys-tems are right now not capable to recognise social signals likehuman beings, especially, in interactions with several users.Otherwise, a system which is familiar with a certain user mightimprove the HMI [1]. Therefore, a personalisation is desirable,i.e., the system adapts itself to the user by regarding hisbehaviour, emotions, and intentions. Specifically, this leads totechnologies with companion-like characteristics [2] that caninteract with a certain user in a more efficient way independentfrom the situation and environment. Moreover, the system is akind of assistant and partner. Right now, personalisation is pro-vided by logging of events and adaptations which are done bythe user himself, e.g. dictate systems. Hence, questions occur:How can we get systems, which are recognised as companion-like assistants? And thus, how do they adapt themselves to theuser in detail, e.g. regarding the user’s behaviour, emotions,and intentions?

In this paper we will focus on one aspect of personalisation:The recognition of the user’s emotion. Emotions are highly

complex phenomena that include biopsychological reactions,emotional experiences, expressions, behavioural tendencies,and cognitive evaluations [3], [4], [5]. Due to this complexity amultimodal approach in analyses is necessary and, hence, it isimportant to consider a variety of channels such as prosody,biopsychologic data, gestures, and mimics [6], [7], [8], [9].So far, acted material with requested affective behaviour orslightly natural datasets of high quality are used to trainsystems on either unimodal or multimodal recordings [10].In this study, we will use a dataset with naıve recordingsand the modalities of choice (cf. Sect. III-B - III-C) includeemotion recognition from speech, biopsychological proper-ties, and manually annotated analysis of the facial expres-sions according to the Facial Action Coding System (FACS).For this, we investigate a Wizard-of-Oz (WOZ) experimentcalled EmoRec-Woz I [11] designed to induce emotions (cf.[12], [13]) according to the pleasure-arousal-dominance space(PAD) [14]. The corpus simulates typical situations foundin HMI, inducing emotions by delays, misunderstandings,ignoring commands, time pressure, etc., but also by positivefeedbacks. Thus, the user is pushed into each of the eightoctants of the PAD space. As the users were not informedabout the WOZ characteristic of the experiment and werenot actors, the material can be seen as naıve. Hence, itprovides natural HMI and, furthermore, material representingdifferent situations. It is assumed that, especially, in time-critical and stressful situations a user communicates usuallyin short utterances. This way of speaking can be seen, forinstance, in traffic, medical (in case of emergency), police,and military scenarios as well as in dialogues in call-centres[15]. For this, an analysis of elliptical sentences is helpful toprovide additional information to a system. Usually, commondatasets do not provide such command style utterances. Hence,we decided for EmoRec-Woz I because it does not merelyprovide aforementioned utterances but also biopsychologicaland video modalities.

In this study, we prove the statistical significance of chosenmodalities in emotion recognition, especially in comparisonof intraindividual against interindividual classification of emo-tions. Indeed, these evaluations are necessary to get in touchwith personalisation and to find a starting point for calibration.

2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support, New Orleans, LA

978-1-4673-0345-3/12/$31.00 ©2012 IEEE 59

Page 2: [IEEE 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2012) - New Orleans, LA, USA (2012.03.6-2012.03.8)]

Especially, the calibration is difficult due to the unknownbaseline of emotional behaviour for each user, that is getting avalue for each feature which represents a neutral emotion. Incase of audio classification we found that intraindividual andinterindividual systems provided good results. Biopsychologi-cal classifiers, in particular, if they are calibrated, showed, ingeneral, recognition accuracies of > 75%. Comparable resultswere gained by video analysis. This leads to a personalisationprocess which is based on interindividual classification first,using material which is clearly detected to adapt the system,and finally training classifiers which are adjusted to a certainuser.

II. EXPERIMENTAL SETUP

A. Experimental Design

To simulate a natural interaction, the HMI was implementedas a WOZ experiment hosted by the Section of Medical Psy-chology, University Ulm, Germany. This experiment, calledEmoRec-Woz I [11], represents an automatic voice controlledmemory trainer, based on the popular memory game “Con-centration”.

The “wizard” has full control over the behaviour of thesystem and thus, in this case, over the emotion induction.Moreover, due to soft time restrictions and time-critical count-downs, which are used in two Experimental Sequences (ES),(cf. Table I) a stressful situation for the user is generated. Asmuch in critical situations as in daily life, tasks have to besolved under stress and it is assumed that such an atmosphereinfluences the emotion and behaviour directly [3].

The corpus consists of 20 subjects, ten women and tenmen who are native Germans and further not actors. Hence,the material is naıve (i.e., natural, non-acted) and gives anindication to a natural HMI.

The experiment consisted of six different ESs. The charac-teristics of the ES are given in Table I. For detailed informationof the corpus and ESs see [11].

B. Experimental Equipment

In this section, we briefly introduce the utilised equipmentas well as the corresponding coding system.

Audio Sensors: In the experimental setup two types ofmicrophones are used. At first, the internal sound systemof a Canon MD215 camcorder was utilised to record 15participants and in the second phase of the experiment aSennheiser ME66 shotgun microphone (sampling rate 44.1kHz). In both cases the subject’s as well as wizard’s statementsand commands were recorded.

Biopsychological Sensors: To derive the biopsycholog-ical signals of the subjects during the WOZ experiment theNexus-32 amplifier was applied, recorded with the Biobservesoftware. The following parameters are included in the classi-fication:∙ Skin Conductance Level (SCL) (according to [16])∙ Respiration (RS)∙ Blood Volume Pulse (BVP) including heart rate and

strength of the heart beat (according to [16])

TABLE IISELF-RATINGS OF THE USER DURING THE ES’ IN PAD SPACE ACCORDING

TO SELF ASSESSMENT MANIKINS.

User-code

P ES-2 A ES-2 D ES-2 P ES-5 A ES-5 D ES-5

112 7 3 7 4 7 4114 8 3 8 3 7 3118 7 5 7 4 7 6125 7 7 8 1 1 9127 8 3 8 2 7 3129 8 4 7 4 7 4208 7 3 7 1 9 1212 9 9 4 3 6 4215 9 2 9 7 3 9219 9 1 9 2 7 2225 8 5 9 8 5 9226 7 6 7 2 7 4308 7 4 7 3 6 4423 7 3 7 7 1 5427 8 5 2 3 3 2506 7 7 6 3 7 7510 7 4 8 5 6 5511 7 3 7 3 6 2518 9 5 7 1 7 2602 7 3 7 2 8 2mean 7.20 4.25 7.05 3.40 5.85 4.35

∙ Electromyogram (EMG)

Facial Action Coding System: Mimic analysis is basedon the FACS [17] which is a system to assess human facialexpressions. Based on the anatomy and muscles of the facemore than 40 Action Units (AUs) can be defined to describefacial expressions in an objective way. Furthermore, FACScan be used to systematically categorise emotions. For theseven basic emotions (happiness, sadness, surprise, anger, fear,disgust, and contempt (cf. [18])) specific AU combinations aredefined in an EmFACS manual [19].

III. METHODS AND CLASSIFIERS

For the automatic emotion classification analysis we chosetwo octants of the PAD space, namely “positive pleasure, lowarousal, high dominance” (called session ES-2) vs. “negativepleasure, high arousal, low dominance” (named session ES-5). Hence, we realised the experiment so far as a two classdecision determining whether the emotion belongs to ES-2 orES-5. We decided to use these sessions because they providedthe most expressive and meaningful parts of the experiment.Furthermore, by design of the WOZ scenario, it is ensured thatthe user is in the emotion we liked to classify and not in anunspecific mixture of emotional states. During the experimentthe users were asked to give a self-rating for validation of theemotion induction [20] (cf. Table II). The rating was on a scalefrom 1 (low) to 9 (high).

A. Audio Classification

In the audio classification we used the Hidden MarkovToolkit (HTK) of the Cambridge University [21]. HTK appliesHidden Markov Models (HMMs) as a statistical machinelearning method and the corresponding learning algorithms togenerate a classification result from an input sequence. This

60

Page 3: [IEEE 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2012) - New Orleans, LA, USA (2012.03.6-2012.03.8)]

TABLE ICHARACTERISTICS OF THE EXPERIMENTAL SEQUENCES. FOR SOME SEQUENCES THE AROUSAL WAS INCREASED (*) BY INTRODUCING A COUNTDOWN.

ES Duration (approx.) Expected state Matrix size Difficulty Feedback

ES-1 3 min * PAD+ − + 4x4 low slightly positive feedback (e.g. “you have suc-cessfully solved the first task.”)

ES-2 3 min PAD+ − + 4x4 low positive feedback (e.g. “great!”)ES-3 5 min PAD+ − + 4x5 low neutral feedback (e.g. “your performance is

average.”)ES-4 3 min * PAD− − − 4x5 high slightly negative feedback (e.g. “your perfor-

mance is declining.”)ES-5 4 min PAD− + − 5x6 very high negative feedback (e.g. “you are doing very

poorly.”)ES-6 5 min PAD+ − + 5x6 low positive feedback (e.g. “your performance is

improving.”)

TABLE IIIOVERVIEW OF THE FEATURES USED IN THE EXPERIMENT.

Audio Biopsychological FACS

12 MFCCs mean value of EMG AUs per minuteEnergy mean heart rateΔ mean amplitudes of respirationΔΔ mean difference of r-/s-peak

mean of first derivative of SCL

technique is used in speech recognition as well as emotionrecognition from speech [22].

The features which are necessary to train and test themodels are extracted from speech, using short utterances ofthe users. Mainly, these are command-like statements, i.e., thecommands to play the memory game, e.g. “A3”, “C4”, “C2”,etc. From these sequences, features, which are suitable forthe toolkit and, furthermore, commonly accepted in the emo-tion/speech recognition community, are extracted as follows:We used 12 standard Mel-Frequency Cepstral Coefficients(MFCC), the Energy term, and in addition, the first (Δ) andsecond (ΔΔ) derivatives of each coefficient, i.e., 39 features(cf. Table III). The parameters of the training process werechosen according to [23].

In testing, we conducted intraindividual experiments, i.e.,material uttered by only one speaker is used for training andtesting, as well as interindividual validation. Interindividualvalidation, also commonly known as Leave-One-Speaker-Outtechnique, means that one speaker’s material is left out in train-ing and afterwards utilised for testing only. The intraindividualsetup is as follows: the material is split randomly in two sets,a training set (90% of the material) and a test set (remaining10%). Such experiment is usually repeated ten times and,hence, the results (cf. Sect. IV-A) are given as average values.

B. Biopsychological Classification

We have segmented ES-2 and ES-5 in small chunks of fourseconds with an overlap of 250ms. On each of these fragments(about 150-200 fragments in ES-2 and 200-300 fragments inES-5) preprocessing has been executed on each psychologicalchannel as described in Sect. II-B returning five values (cf.Table III). Artificial neural networks are used as classifiers.Since the data recorded by the four channels are very different,

they require different preprocessing algorithms: The first inputvalue for our neural network was the corrugator EMG channel.Here, we have used the mean of the outcome of a moving-average algorithm with window size of 20ms to acquire thetension of the muscle. To obtain the heart rates from thecontinuous BVP signal, we identified each r-peak (highestpoint within one heartbeat in the BVP signal). Additionally,the mean heart rate of three heartbeats within the measuredfour seconds and the mean of the amplitudes of the respirationchannel data were computed. Further, the mean value of thedifference between the r-peak and the s-peak (lowest pointwithin one heartbeat in the BVP signal) of the three heartbeats is applied in the network. Finally, the mean value of thefirst derivation of the data of the SCL channel acts as input.Automatic emotion recognition systems should have certainrobustness, for instance, in environments like combat missions,etc. Those signals will be very high vulnerable with respectto motion artefacts. Therefore, it was checked exploratively,which are the five best features for biopsychological signalsin such cases.

Paired t-tests and effect sizes (cf. [24]) were calculated forthe comparison of intraindividual and interindividual classi-fiers.

C. Facial Action Coding System

The participants were videotaped during the experiment anda certified FACS coder observed all changes in facial muscu-lature according to FACS manual [17]. Hence, numbers of allAUs shown during ES-2 and ES-5 were gathered and countedfor every participant. Because experimental sessions differedin time for each AU a weighted score of AU per minute wascalculated separately (cf. Table III). ES-2 and ES-5 were takenfor further analyses, because it is assumed that differencesin the mimic expression would be quite meaningful. For acomparison, statistical analyses with U-tests were conducted.

IV. RESULTS

To retrieve the results for modalities’ classification, therecordings of 20 subjects are investigated.

For the statistical analysis the probability value p, which isalso called the significance level, is used and its value will bechosen by the experimenter randomly. Hence, p is a reference

61

Page 4: [IEEE 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2012) - New Orleans, LA, USA (2012.03.6-2012.03.8)]

for the statistical significance that has to be below 0.05, whichis a common value in psychology. Further, the t-value of t-testrepresents the critical value/measurement to compete with theempirical value of a statistic. For this, the empirical value hasto be smaller than the t-value [25]. The effect size d (Eq. 1) isa measure of strength of the relationship between two variableswhere μ is the mean and σ the standard deviation. Accordingto Cohen [24] an effect size d < 0.5 is interpreted as low, 0.5< d < 0.8 as average, and d > 0.8 as high.

d =∣ μ1−μ2 ∣

σ(1)

The p and d values are used to get an estimation of thesignificance.

Before we present the individual results of the classifiers,we will discuss some findings with a general characteristic.Especially, audio and biopsychological classification wereeffective in intraindividual analyses. Once a system is cali-brated, it gained high accuracy values. The system trained onbiopsychological features reached better accuracies in average,but had higher standard deviations than an audio classifier.The audio system had its advantage in the interindividualclassification due to > 50.2% accuracy in worst case comparedto 11.3% and 16.2%, respectively. Based on these results,we computed the p and d values (cf. Sect. IV-A and IV-B).Due to the method itself, FACS makes predictions only forintraindividual analysis. Hence, a meaningful comparison toother classifiers is only possible in this sense. With FACShigher values for p, 0.047 < p < 0.082, were gained thanin audio or biopsychological analyses. Nevertheless, due tothe number of occurrences of each AU we found AUs whichare significant for classification of emotions (cf. Sect. IV-C),especially, in comparison of different user groups (young vs.old users). Finally, the gained results were taken to obtain aglobal classification.

A. Audio Results

In Table IV (confusion matrix), we present the overallresults of a system that classified utterances in an interindi-vidual way. The detailed values for each speaker who wasleft out are given in Table V. By looking at the classificationresults, there is a large variety in the values between thespeakers not even in the intraindividual experiments but alsobetween the two experimental sequences. The range is from3.8% to 100.0% recognition rate. Especially, in ES-5 theseresults are quite significant. The variation of intraindividualvs. interindividual results is reasonable by the nature of theexperiment. Moreover, manual expert analyses showed thatthe difference is also due to various baselines of the userand, further, to the expressivity in emotions. On the otherhand, interindividual results are more focused considering ES-2 to ES-5 individually. Nonetheless, the absolute values forintraindividual recognition rates are better (cf. Table V). Thisis due to the adaptation of the system to a user, also calledspeaker-dependent classification [26]. Unfortunately, to get acalibrated classifier a huge amount of data is necessary. Hence,

TABLE IVCONFUSION MATRIX (NUMBER OF SAMPLES ASSIGNED TO ES) AND

RECOGNITION RATES IN PER CENT FOR INTERINDIVIDUAL EXPERIMENTS

WITH HMMS.

ES-2 ES-5 %

ES-2 345 243 58.7ES-5 446 409 47.8

according to the material’s quality and the number of sampleswhich are derived for each subject, a fluctuation in the findingswas obvious. As it is explained in Sect. III-A, intraindividualexperiments were repeated ten times with randomly chosentraining and test samples. Thus, the values given in Table Vare averages of the accuracies. From these results, it isreasonable to start with an interindividual classifier and collectmaterial which can be used to adapt the system towards anintraindividual one.

To verify this proposal, we also considered the experimentalsequences individually. Analysing the classification results forES-2 gave the intraindividual mean value μ = 64.35 with astandard deviation of σ = 19.94. Comparing these values withthe interindividual ones (μ = 60.0, σ = 27.3), the probabilityvalue p was not significant and according to [24] the effectstrength is low (d = 0.18). In contrast, in ES-5 the results weresignificant since p < 0.01 and d = 1.05 which is a high effectstrength, i.e., in detail we got for intraindividual classifiersμ = 74.90 and σ = 12.97 as well as for interindividualclassifiers μ = 50.23 and σ = 30.66. Due to higher expressivityin ES-5 representing negative emotions which is assured bybiopsychological and FACS analyses as well as manual expertanalysis, we gained better results in automatic classification.

B. Biopsychological Results

The classification results applying biopsychological featuresare presented in Table VI.

For ES-2 and ES-5 we got quite significant p values whichare in both cases p < 0.001. In ES-2 the effect strength d =3.76 is high where in detail intraindividual classification has μ= 80.61 and σ = 21.64 vs. interindividual classification with μ= 11.34 and σ = 14.46. Moreover, for ES-5 the effect strengthd = 4.94 is higher due to better performance of intraindividualclassification (μ = 86.4, σ = 16.33) against interindividualone (μ = 16.26, σ = 11.33). This gives an indication thatbiopsychological features are better in classifying situationslike ES-5. The fact that intraindividual classification algorithmhave a better accuracy is not really a new discovery. Butwhat we showed is that the comparison of intraindividualvs. interindividual classification rates has high relevance forbiopsychological parameters. In comparision to audio featuressuch a difference is not as distinct (cf. Table V). Hence,intraindividual as well as interindividual classification can beapplied for this modality. In contrast, the peripheral physiolog-ical and central nervous system is high complex, this aspect isreinforced by an interindividual observation. This illustrates,only intraindividual classification methods enable accurate androbust emotion recognition for biopsychological systems.

62

Page 5: [IEEE 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2012) - New Orleans, LA, USA (2012.03.6-2012.03.8)]

TABLE VRECOGNITION RATES FOR INTRAINDIVIDUAL AND INTERINDIVIDUAL CLASSIFICATION OF ES-2 VS. ES-5 IN PER CENT WITH HTK.

Subject 1 2 3 4 5 6 7 8 9 10

ES-2 (intraindividual) 50.0 50.0 38.5 50.0 50.0 65.0 68.4 71.4 100.0 81.3ES-5 (intraindividual) 66.7 75.0 75.0 70.8 84.6 70.0 84.0 100.0 100.0 63.2

ES-2 (interindividual) 84.0 66.7 88.5 74.2 82.1 86.2 73.1 72.4 90.5 50.0ES-5 (interindividual) 33.3 36.5 3.8 34.0 22.6 24.3 26.4 86.7 13.3 63.2

Subject 11 12 13 14 15 16 17 18 19 20

ES-2 (intraindividual) 100.0 100.0 80.0 43.9 68.8 58.5 40.0 40.6 70.0 61.1ES-5 (intraindividual) 93.3 72.5 55.8 59.1 68.8 83.8 65.0 81.6 68.0 59.1

ES-2 (interindividual) 21.4 28.1 40.6 14.3 55.2 31.2 97.0 79.3 10.7 54.5ES-5 (interindividual) 83.8 85.0 75.0 97.7 66.7 71.4 9.3 16.7 73.9 80.0

TABLE VIRECOGNITIONS RATES FOR INTRAINDIVIDUAL AND INTERINDIVIDUAL CLASSIFICATION OF ES-2 VS. ES-5 IN PER CENT WITH BIOPSYCHOLOGICAL

CLASSIFIERS.

Subject 1 2 3 4 5 6 7 8 9 10

ES-2 (intraindividual) 99.5 86.7 96.8 63.9 81.3 93.9 90.5 59.8 61.4 84.8ES-5 (intraindividual) 98.8 94.7 98.6 81.1 95.6 98.5 95.6 75.5 77.1 83.7

ES-2 (interindividual) 3.7 0.0 6.5 6.0 0.0 2.6 3.6 92.2 0.0 0.0ES-5 (interindividual) 6.1 1.8 34.7 30.0 17.4 20.1 2.9 18.0 6.4 12.6

Subject 11 12 13 14 15 16 17 18 19 20

ES-2 (intraindividual) 95.5 77.5 80.9 63.4 76.4 81.9 91.6 99.1 50.2 77.0ES-5 (intraindividual) 92.1 79.1 87.4 65.5 68.0 90.3 96.9 98.4 67.2 84.0

ES-2 (interindividual) 0.0 27.5 0.0 34.6 0.0 25.4 1.8 13.9 1.0 7.9ES-5 (interindividual) 51.4 9.7 4.3 1.7 15.8 3.4 4.4 54.8 36.7 7.0

TABLE VIICOMPARISON OF MEANS (μ ) AND STANDARD DEVIATIONS (σ ) FOR ES-2 AND ES-5 GROUPED BY AGE OF PARTICIPANTS. VALUES ARE GIVEN IN AU PER

MINUTE. – INDICATES THAT THE AU WAS NOT SHOWN IN THIS ES.

Sub-sample AU μ ES-2 σ ES-2 μ ES-5 σ ES-5

Total sample 6 0.12 0.30 0.44 0.7234 – – 0.05 0.15

Young participants 23 – – 0.08 0.17

Old participants 6 0.24 0.40 0.79 0.8812 0.28 0.38 1.13 1.1318 0.05 0.15 0.28 0.37

C. Facial Action Coding System Results

The minimum amount of shown muscular activity during theexperiment was 15.43 AU per minute, the maximum value is75.54. Eight AUs were non-existing in this sample: AU 8, 21,29, 31, 35, 36, 37, and 38. All significant results are reportedfor a two-tailed test. Average values and standard deviationcan be found in Table VII. When comparing ES-2 and ES-5in general AU 6 and AU 34 significantly differ (p = 0.047;p = 0.076) between these two diverging emotional sequences.The younger participants showed significantly less muscularactivity corresponding to AU 23 in ES-2 than in ES-5 (p =0.068). For the elderly, sub-sample ES-2 and ES-5 could besignificantly divided by looking at AU 6, 12, and 18 (p =0.082; p = 0.053; p = 0.041).

A smile is described with AU 12, the “lip corner puller”.The lip corners are pulled upwards towards the cheek whenshowing this AU. For the elderly people in this sample thisAU can divide ES-2 vs. ES-5 by coming up more in ES-

5. It can be guessed that people show a smile-like musclemovement that goes along with frustration, being annoyed, orastonishment. AU 6 is not significant in these cases whichwould indicate a “Duchenne-smile”. The cheek raiser andlid compressor, AU 6, may also be used when being veryconcentrated, because people then often focus a certain point(on the computer screen). ES-2 and ES-5 can be discriminatedwhen looking at the whole sample and the older participants.Another AU that is relevant to distinguish between ES-2 andES-5 is AU 34. This AU describes a muscle activity whenpeople puff out their cheeks. We hypothesised that this maygo along with high concentration or even excessive demandsdue to the fact that it is only shown in situations with negativevalence. For the younger population only one AU could qualifyfor discrimination - AU 23, the lip lightener. This AU is onlyshown in ES-5 so that it is likely that this AU also representsa negative emotional state like frustration. Unfortunately, therelevance of single AUs is not considered in current codingsystems like EmFACS [19]. Therefore, nearly no literature on

63

Page 6: [IEEE 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2012) - New Orleans, LA, USA (2012.03.6-2012.03.8)]

the basis of single AUs is accessible, but the present studydemonstrates that the specificity of single AUs in HMI ishighly relevant with respect to facial emotion recognition.

V. CONCLUSION AND OUTLOOK

By analysing the EmoRec-Woz I corpus we got an indi-cation that audio and biopsychological features are related.Especially, individual recognition rates of the experimentalsequences (cf. Table V and Table VI) support this assumption.Both experiments show the need for personalisation with highsignificance. In audio, the tendency is weaker. In contrast,this modality has the advantage of a suitable correctness ininterindividual classification. This can be used to get firstresults while the system is adapting itself to the specificityof the subject. Moreover, this adaptation can be technicallyrealised easier for audio (cf. [27]) than for biopsychology.Specifically in the phase of calibration the classification resultshould be based on interindividual audio systems. In themeantime suitable material for the biopsychological classifiercan be collected and be used to calibrate the biopsychologicalsystems. The output of the interindividual classifiers is usedto label the biopsychological samples (material for calibration)with emotional marks. More generally, in the beginning of aninteraction the classification should be based on a modalitythat works interindividually and afterwards it can be switchedto methods that work intraindividually. This can be alsosupported by FACS which has so far the drawback of manualannotation. In particular, audio and FACS features dependon the expressivity of the subject. Thus, the third modalityof biopsychological signals is an additional and informativefeature.

By the presented analyses and the aforementioned state-ments, we will investigate the influence of fusion, i.e., com-bination of different features and results of classifiers, to theemotion recognition in the personalisation process. Especially,in cognitive technological systems with companion-like char-acteristics [2] we assume to have enough material to evolvesuch systems and to test these over a longer period. Forthis, the modalities have to be integrated in the daily life insuch a way that the user is not annoyed by it, especially incritical environment, e.g. hospitals or battlefields. For instance,biopsychological sensors and microphones can be inserted inclothes and smart cameras in helmets.

VI. ACKNOWLEDGEMENT

We acknowledge continued support by the TransregionalCollaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by theGerman Research Foundation (DFG). We also acknowledgethe DFG for financing our computing cluster used for parts ofthis work.

REFERENCES

[1] E. Hudlicka, “To feel or not to feel: The role of affect in human-computer interaction,” International Journal of Human Computer Stud-ies, vol. 59, no. 1-2, pp. 1–32, 2003.

[2] A. Wendemuth and S. Biundo, “A companion technology for cognitivetechnical systems,” in COST 2012 Conference ”Cross Modal Analysisof Verbal and Nonverbal Communication”, Dresden, Germany, 2011.

[3] K. R. Scherer, “Appraisal considered as a process of multilevel se-quential checking,” Appraisal processes in emotion: Theory, methods,research, pp. 92–120, 2001.

[4] M. Holodynski and W. Friedlmeier, Development of emotions andemotion regulation. Springer, 2006.

[5] S. Walter, D. Hrabal, A. Scheck, H. Kessler, G. Bertrand, F. Nothdurft,W. Minker, and H. Traue, “Individual emotional profiles in wizard-of-oz-experiments,” in ACM International Conference Proceeding Series,2010.

[6] R. W. Picard, Affective Computing. Cambridge, MA: MIT Press, 2000.[7] L. Kuncheva, Combining pattern classifiers: methods and algorithms.

Wiley, 2004.[8] J. Kim and E. Andre, “Emotion recognition based on physiological

changes in listening music,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 30, no. 12, pp. 2067–2083, 2008.

[9] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affectrecognition methods: Audio, visual, and spontaneous expressions,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 31,no. 1, pp. 39–58, 2009.

[10] H. Gunes and M. Pantic, “Automatic, dimensional and continuous emo-tion recognition,” International Journal of Synthetic Emotions, vol. 1,no. 1, pp. 68–99, 2010.

[11] S. Walter, S. Scherer, M. Schels, M. Glodek, D. Hrabal, M. Schmidt,R. Bock, K. Limbrecht, H. Traue, and F. Schwenker, “MultimodalEmotion Classification in Naturalistic User Behavior,” in Proc. ofthe 14th International Conference on Human-Computer Interaction,Orlando, USA, 2011.

[12] P. Lang, M. Bradley, and B. Cuthbert, “International affective picturesystem (iaps): Affective ratings of pictures and instruction manual,”University of Florida, Gainesville, FL., Technical Report A-8, 2008.

[13] D. Ververidis and C. Kotropoulos, “Emotional speech recognition:Resources, features, and methods,” Speech Communication, vol. 48,no. 9, pp. 1162–1181, 2006.

[14] A. Mehrabian, “Pleasure-arousal-dominance: A general framework fordescribing and measuring individual differences in Temperament,” Cur-rent Psychology, vol. 14, no. 4, pp. 261–292, 1996.

[15] B. Vlasenko and A. Wendemuth, “Processing affected speech withinhuman machine interaction,” in INTERSPEECH-2009, vol. 3, Brighton,2009, pp. 2039–2042.

[16] C. E. Izard, The Psychology of Emotions, 1st ed. Springer, 2004.[17] P. Ekman and W. V. Friesen, “Facial action coding system: A technique

for the measurement of facial movement.” Consulting Psychologists,1978.

[18] P. Ekman, Basic Emotions. John Wiley & Sons, Ltd, 2005, pp. 45–60.[19] P. Ekman and W. V. Friesen, “Emfacs,” Unpublished Document. San

Franciso, USA, 1982.[20] M. M. Bradley and P. J. Lang, “Measuring emotion: The self-assessment

manikin and the semantic differential,” Journal of Behavioral Therapyand Experimental Psychiatry, vol. 25, no. 1, pp. 49–59, 1994.

[21] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book, version 3.4. Cambridge University EngineeringDepartment, 2009.

[22] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognitionusing hidden markov models,” Speech Communication, vol. 41, no. 4,pp. 603–623, 2003.

[23] R. Bock, D. Hubner, and A. Wendemuth, “Determining Optimal SignalFeatures and Parameters for HMM-Based Emotion Classification,” inProc. of the 15th IEEE Mediterranean Electrotechnical Conference.IEEE, 2010, pp. 1586–1590.

[24] J. Cohen, “A power primer,” Psychological Bulletin, vol. 112, no. 1, pp.155–159, 1992.

[25] D. Freedman, R. Pisani, and R. Purves, Statistics. Norton, 2007.[26] B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, and A. Wendemuth,

“Comparing one and two-stage acoustic modeling in the recognitionof emotion in speech,” in 2007 IEEE Workshop on Automatic SpeechRecognition and Understanding (ASRU), 2007, pp. 596–600.

[27] B. Vlasenko, D. Philippou-Hubner, D. Prylipko, R. Bock, I. Siegert, andA. Wendemuth, “Vowels formants analysis allows straightforward detec-tion of high arousal emotions,” in 2011 IEEE International Conferenceon Multimedia and Expo (ICME), 2011.

64