the more the merrier: analysing the affect of a group of...

8
The More the Merrier: Analysing the Affect of a Group of People in Images Abhinav Dhall 1,2 , Jyoti Joshi 1 , Karan Sikka 3 , Roland Goecke 1,2 and Nicu Sebe 4 1 HCC Lab, Vision & Sensing Group, University of Canberra, Australia 2 IHCC Group, RSCS, Australian National University, Australia 3 University of California San Diego, USA 4 University of Trento, Italy [email protected], [email protected], [email protected], [email protected], [email protected] Abstract— The recent advancement of social media has given users a platform to socially engage and interact with a global population. With millions of images being uploaded onto social media platforms, there is an increasing interest in inferring the emotion and mood display of a group of people in images. Automatic affect analysis research has come a long way but has traditionally focussed on a single subject in a scene. In this paper, we study the problem of inferring the emotion of a group of people in an image. This group affect has wide applications in retrieval, advertisement, content recommendation and security. The contributions of the paper are: 1) a novel emotion labelled database of groups of people in images; 2) a Multiple Kernel Learning based hybrid affect inference model; 3) a scene context based affect inference model; 4) a user survey to better understand the attributes that affect the perception of affect of a group of people in an image. The detailed experimentation validation provides a rich baseline for the proposed database. I. INTRODUCTION Groups are emotional entities and a rich source of varied manifestations of affect. The literature in social psychology suggests that group emotion can be conceptualised in dif- ferent ways and is best represented by pairing the top-down approach (emotion emerging at the group level and followed by individual participants of the group) and bottom-up ap- proach (overall emotion of group constructed by uniqueness of individual members’ emotion expression) [1], [2]. This paper follows the group-as-a-whole perspective to capture elusive emotions arising from a group, broadly focussing on positive and negative affect in images. Automatic analysis of a group of people is an important problem as it has a wide variety of applications such as image retrieval, early event prediction, surveillance, image set visualisation, among others. The model proposed in this paper encompasses both scene information to determine the effect of the top-down approach and individuals’ facial features to confirm the bottom-up method. Affective computing has seen much progress in recent years, especially in automatic emotion analysis and under- standing via verbal and non-verbal cues of an individual [3]. However, until recently, relatively little research has examined ‘Group’ emotion in images. To advance affective computing research, it is indeed of interest to understand and model the (perceived) affect exhibited by a group of people in images. The initial impediment in pursuing this research is obtaining the data, which should contain multiple partic- ipants exhibiting diverse emotions in real-world situations. Fig. 1. Images of a group of people in a social event. The upper, middle and lower rows contain images, where the group displays a positive, neutral and negative affect, respectively. Furthermore, it is required to create a framework, which can model the perceived affect of group of people in an image. Advanced and inexpensive sensor technology has resulted in exponential growth in the number of images and videos being uploaded on the internet. This large pool of data enables us to explore the images containing multiple partici- pants (e.g. Figure 1). Consider an image from a social event, such as a birthday party or images of a group of people watching a football game. For automatically organising, visualising and retrieval of these images, there are various cues such as colour, faces, and meta-data information that can be used. The presence of a group of people, posing for or being captured in a photograph, provides an opportunity of analysing the affect, which is perceived by a viewer. This paper describes a novel database and framework for affect classification of a group of people in an image. Note that, in this paper, we are interested in inferring the group affect perceived by the viewer of the image. The images in the proposed database have been collected from the WWW. No self evaluation of affect is available from the members of the group in the images. The terms ‘affect’ and ‘emotion’ are used interchangeably throughout this paper as

Upload: others

Post on 10-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

The More the Merrier: Analysing the Affect of a Group of People in Images

Abhinav Dhall1,2, Jyoti Joshi1, Karan Sikka3, Roland Goecke1,2 and Nicu Sebe41 HCC Lab, Vision & Sensing Group, University of Canberra, Australia

2 IHCC Group, RSCS, Australian National University, Australia3 University of California San Diego, USA

4 University of Trento, [email protected], [email protected], [email protected], [email protected],

[email protected]

Abstract— The recent advancement of social media has givenusers a platform to socially engage and interact with a globalpopulation. With millions of images being uploaded onto socialmedia platforms, there is an increasing interest in inferringthe emotion and mood display of a group of people in images.Automatic affect analysis research has come a long way buthas traditionally focussed on a single subject in a scene. In thispaper, we study the problem of inferring the emotion of a groupof people in an image. This group affect has wide applications inretrieval, advertisement, content recommendation and security.The contributions of the paper are: 1) a novel emotion labelleddatabase of groups of people in images; 2) a Multiple KernelLearning based hybrid affect inference model; 3) a scenecontext based affect inference model; 4) a user survey to betterunderstand the attributes that affect the perception of affect ofa group of people in an image. The detailed experimentationvalidation provides a rich baseline for the proposed database.

I. INTRODUCTIONGroups are emotional entities and a rich source of varied

manifestations of affect. The literature in social psychologysuggests that group emotion can be conceptualised in dif-ferent ways and is best represented by pairing the top-downapproach (emotion emerging at the group level and followedby individual participants of the group) and bottom-up ap-proach (overall emotion of group constructed by uniquenessof individual members’ emotion expression) [1], [2]. Thispaper follows the group-as-a-whole perspective to captureelusive emotions arising from a group, broadly focussing onpositive and negative affect in images. Automatic analysisof a group of people is an important problem as it has awide variety of applications such as image retrieval, earlyevent prediction, surveillance, image set visualisation, amongothers. The model proposed in this paper encompasses bothscene information to determine the effect of the top-downapproach and individuals’ facial features to confirm thebottom-up method.

Affective computing has seen much progress in recentyears, especially in automatic emotion analysis and under-standing via verbal and non-verbal cues of an individual[3]. However, until recently, relatively little research hasexamined ‘Group’ emotion in images. To advance affectivecomputing research, it is indeed of interest to understand andmodel the (perceived) affect exhibited by a group of peoplein images. The initial impediment in pursuing this researchis obtaining the data, which should contain multiple partic-ipants exhibiting diverse emotions in real-world situations.

Fig. 1. Images of a group of people in a social event. The upper, middleand lower rows contain images, where the group displays a positive, neutraland negative affect, respectively.

Furthermore, it is required to create a framework, which canmodel the perceived affect of group of people in an image.

Advanced and inexpensive sensor technology has resultedin exponential growth in the number of images and videosbeing uploaded on the internet. This large pool of dataenables us to explore the images containing multiple partici-pants (e.g. Figure 1). Consider an image from a social event,such as a birthday party or images of a group of peoplewatching a football game. For automatically organising,visualising and retrieval of these images, there are variouscues such as colour, faces, and meta-data information thatcan be used. The presence of a group of people, posing foror being captured in a photograph, provides an opportunityof analysing the affect, which is perceived by a viewer.This paper describes a novel database and framework foraffect classification of a group of people in an image. Notethat, in this paper, we are interested in inferring the groupaffect perceived by the viewer of the image. The imagesin the proposed database have been collected from theWWW. No self evaluation of affect is available from themembers of the group in the images. The terms ‘affect’ and‘emotion’ are used interchangeably throughout this paper as

Page 2: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

both are semantically similar terms for general representationof individual’s feeling response [1].

In affect analysis, many interesting approaches [4], [5],[6], [7], [8] have been proposed for different problemsranging from expression / emotion inference, such as [9],[5], to medical applications such as pain detection [10] andentertainment [6]. However, apart from [8], other methodsonly deal with a single person. While prior work exists inanalysing the non-verbal behaviour of a group of people invideos in social scenarios such as meetings [11], containingmultiple persons. However, the problem, which this papertries to tackle is different. We are interested at looking atimages of people in social events; on the contrary, [11] lookat group of people in videos for the understanding of theirinteraction.

The key contributions of this study are:1) A novel hybrid framework for affect inference of a

group of people in an image.2) A labelled database containing images of groups of

people in a wide variety of social events.3) Multiple Kernel Learning (MKL) based fusion for

adding local and scene context.4) A user survey to understand the factors that affect the

perception of affect displayed by a group.

II. RELATED WORK

Recently, the study of a group of people in an imageor a video has received much attention in the domain ofcomputer vision. Generally, these methods can be broadlydivided into two categories: a) Bottom-up methods: Thesubject’s attributes are used to infer information at the grouplevel [12], [13], [14]; b) Top-down methods: The group /sub-group information is used as a prior for inference ofsubject level attributes [15], [16], [17].

First, we look at the bottom-up methods. Recent workin crowd tracking [12] is based on trajectories constructedfrom the movement of people. Ge et al. [12] propose ahierarchical clustering algorithm, which detects sub-groupsin crowd video clips. Cameras were installed at the MITcampus for inferring the mood of the passerby [13]. In [13],the scene level happiness is an average over the smiles ofmultiple people. However, in reality, group mood is not anaveraging model [2].

In [8], [18], we proposed models and a database (HAPPEI)for inferring the happiness mood intensities of a group ofpeople in images. The hypothesis is that there are certainattributes, which affect the perception of happiness of a groupof people in an image. We proposed group expression modelsbased on topic modelling and manually defined attributes.Inspired by Gallahger et al. [15], we represented a groupas a min-span tree, in which the faces are the vertices andthe weights of the edge define the distance between twofaces. The context is based on a survey conducted [18].Further analysis and details of the survey are in SectionIV in this paper. Here, we extend the data and methodfrom positive affect only [18] to a wider gamut of emotion(Positive-Neutral-Negative) of group of people. Furthermore,

as compared to [18], where only faces were analysed, in thiswork, the effect of background/scene is analysed as well ina fusion framework.

In another interesting bottom-up method, Murillo et al.[19] proposed group classification for recognising urbantribes (‘informal club’, ‘beach party’ and ‘hipsters’ etc.).They used low-level features such as colour histogram andhigh-level features such as person attributes to learn a Bag ofWords (BoW) based classifier. Hipster War [20] proposes amethod based on analysing the clothes members of the groupfor their social membership type.

Gallahger et al. [15] proposed a top-down approach basedon group derived contextual features age and gender in-ference of group members. Images were downloaded fromthe internet [15] and the performance in the experimentsshowed the positive effect of using group information. Inanother top-down approach, Wang et al. [21] model the socialrelationship between people standing together in a groupfor aiding recognition. The social relationships are inferredin unseen images by learning them from weakly labelledimages. The authors learn a graphical model based on social-relationships such as ‘father-child’, ‘mother-child’ etc. andsocial-relationship features such as relative height, heightdifference and face ratio. Lu et al. [16] proposed a metriclearning based method inferring the kinship relationship inimages. In object detection and recognition work by Torralbaand Sinha [22], context information of the scene and itsrelationship with the objects is described. Stone et al. [17]proposed conditional random fields based social relationshipmodelling between Facebook contacts for the problem offace recognition.

Inspired by the works mentioned above, we propose ahybrid approach, which combines top-down and bottom-upcomponents. The rest of the paper is arranged as follows: thegroup database collection process is detailed in Section III.A survey to understand the attributes is described in SectionIV. Section V discusses the group affect inference pipeline.Action Unit based representation is discussed in Section V-A. Low-level feature based face representation is representedin Section V-B. Scene Analysis for adding context to thepipeline is described in Section V-C. Experiments and im-plementation details are described in Section VI. Conclusionand future work discussion is presented in Section VII.

III. THE GROUP AFFECT DATABASE

Over the years, several affect databases have been released.The earlier databases, such as Cohn-Kanade [23] or MMI[24], have led to significant contributions to the field. Eachhas a single subject posing a specific facial expression in lab-controlled settings (e.g. plain background, controlled illumi-nation, no occlusion). Lately, databases capturing situationsarising in the real-world environment (GENKI [5], Gallagherdatabase [15], Acted Facial Expressions in the Wild (AFEW)[14], AM-FED [25]) have been released. GENKI [5] containssmiling/non-smiling pictures of celebrities collected from theinternet. The ‘in the wild’ databases, Gallahger database[15] and AFEW [14], contain images/videos collected from

Page 3: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

Fig. 2. Summary of responses in the survey.

Internet and movies, respectively. The AFEW database doescontain a few videos clip samples of multiple subjects. TheGallagher database contains images of groups of people,which have been labelled for age and gender attributes. In [8],we proposed the HAPPEI database, which contains imagesof groups of people displaying happy expression only. Totackle the current problem, we collected a new database thatwas labelled for the perceived affect of a group of people inan image.

The proposed database was acquired by first searchingFlickr.com and Google Images for images related to key-words1, which describe groups and events. Some positiveaffect images were taken from our HAPPEI database [8].Faces were then detected [26] on the downloaded images andthe images containing fewer than two faces were rejected.Given the nature of the images in the database, the subjectsof a group and their expressions can be quite heterogenous.Ideally, one may like to use automatic / semi-automaticmethods such as topic discovery algorithms or parsing ofrelated text for extracting labels. However, given the highintra-class variance due to different scenes and subjects,human annotator labels are being used for the database.This is hence a weakly-labelled problem. We would liketo mention that for the AFEW database, where the initiallabels were generated by parsing subtitles, a final pruningstep was performed by the human labelers, while for theHAPPEI database, the labels were manually annotated bythe labellers. In this work, the emotion of a group in animage is labelled as Positive, Negative or Neutral. Theselabels closely resemble the valence axis only ion the Valence-Arousal emotion scale. We used three human annotators forlabelling the dataset. Any images without label consensuswere removed. The database contains 504 images. Sampleimages from the proposed Group Affect Database are shownin Figure 1.

1The sample keywords were: Tahrir Square, London Protest, BrazilFootball Fans, Excited People, Happy People, Humanitarian Aid, DelhiProtest, Gaza Protest, Party Friends, Police Brutality, Celebration etc.

IV. SURVEY

In order to understand the attributes, which affect theperception of affective state of a group, we conducted a userstudy [18]. Two sets of surveys were developed. In the firstpart, subjects were asked to compare two images for theirapparent affect and rate the one with a higher positive affect.Further, they were asked various questions about the attibutes/ reasons, which made them choose a specific image / groupout of the two images / groups. In the second set, only asingle image is shown and questions were asked about it.

A total of 149 subjects participated in this survey (94males and 55 female subjects). Various questions were askedfor e.g. ‘How would you describe the expression of thegroup?’, ‘Was your choice motivated by: (multiple answersacceptable)’, ‘How would you describe the MOOD of thegroup?’, ‘What are the dominating attribute/characteristicof the leader(s) in the group that effect the group’s mood?’,‘Any other reason for your choice?’. Figure 2 shows theaverage of the responses to the questions. It is evident thatthe location of the subjects, the number of smiles etc. playa dominating role in the perception of mood of a group.Figure 3 describes the dominating words in the responseto the ‘What is the dominating attribute / characteristic ofthe leader(s) in the group that affect your perception of thegroup’s mood?’ and ‘Any other reason for your choice?’.Dominating words about the salient participants in the group

(a) Any other reason

(b) Subject attribute

Fig. 3. Most frequently occurring responses in the survey regarding anyreason other than the questions asked (a) and the most salient subject (b).

Page 4: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

Fig. 4. Analysis of the subject responses of one of the survey images.

are smiling, smiles, attractive, eyes, beauty etc.In Figure 4, ‘Any other reason for your choice?’, 50%

of the subjects mentioned the reason for affect rating asthe pleasant scene. In other examples, subjects mentionedattributes such as age, gender and attractiveness as attributesthat affect their perception of the affect of a group of peoplein an image. Based on these observations, we propose ahybrid framework, which models the local features (faceanalysis Section V-A and Section V-B) and global features(scene descriptor V-C).

V. PIPELINE

The proposed framework for inferring affect is based onmulti-modal fusion using MKL. Figure 5 describes the flowof the proposed method. Below, we discuss each sub-systemand its contribution to the overall framework.

A. Action Unit Based Face Representation

Facial Action Units (AU) describe the activation of facialmuscles, when there is a change in the facial expression.Facial AU are one of the leading and most widely usedrepresentations for facial expression analysis. AU have beenextensively used in inferring affect. Given an aligned faceof a subject in a group, we compute AU features using theCERT toolbox [27]. CERT is based on computation of Gabor

filter based features and SVM based classification. CERT isextensively used by the face analysis community for its real-time and stable performance. Furthermore, to model a group,we learn a BoW representation, in which each member’sface is represented as a word and the group is represented asa document. The BoW representation has been extensivelyused in computer vision (e.g. [19]). It represents a documentas a histogram of unordered frequency of words. We referto the BoW formulation, where AU features are the wordsBoWAU . AU here represent the attributes mentioned in thesurvey such as smiling, happy etc.

B. Low-level Features

Automatic AU detection is an open problem. The presenceof varied backgrounds, head pose / movements, occlusionetc. in challenging conditions make the task even moredifficult. Based on the survey (Section IV), along with facialexpressions, there are several other facial attributes suchas age, attractiveness, gender, presence of occlusion (e.g.beard, glasses etc.) that affect the perception of the affectof a group. Based on these two arguments, we performfeature augmentation by computing low-level features on thealigned faces. Pyramid of Histogram of Gradients (PHOG)[28] and Local Phase Quantization (LPQ) [29] descriptorsare computed over an aligned face. PHOG is computed

Page 5: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

Fig. 5. The proposed group affect inference pipeline.

by applying edge detection on an image, followed by ahistogram computation in a pyramid fashion. For computingLPQ, local binary patterns are computed over the coefficientextracted by applying short Fourier transform on an image.The feature combination of these two features is robust toscale and illumination changes [30].

Similar to BoWAU , a BoW representation is learnt on thelow-level features. This feature is referred to as BoWLL.Furthermore, feature fusion is performed (details in Sec-tion VI) between BoWAU and BoWLL. This fusion basedrepresentation can also be viewed as adding implicit infor-mation (low-level based attributes) to manually-defined userattributes (AU based representation). This is similar to theautomatic action recognition approach defined in [31].

C. Scene Context

Deduced from the survey performed, scene informationplays a vital role in the perception of the group affect.We investigate the usefulness of scene analysis features foradding global information to our model. Two widely usedscene analysis descriptors are compared – GIST [32] andCENsus TRansform hISTogram (CENTRIST) [33] – w.r.t.the problem of affect analysis of a group of people. GISTand CENTRIST model the scene at a holistic level. Werefer to GIST and CENTRIST descriptor representation asSceneGIST and SceneCENTRIST , respectively. The scenefeature computes statistics at a global level. It takes intoconsideration not only the background but information thatmay define the situation such as clothes.

Fusion is performed between BoWAU , BoWLL and scenefeatures. This results in a hybrid framework, where the faceanalysis subsystem represents a bottom-up approach andscene context analysis represents a top-down approach. Thismodel, in fact, takes inspiration from Moshe Bar’s scenecontext model [34], where two concurrent streams are usedfor scene processing: a low-resolution holistic representationand a detailed object level representation. In this paper, thelow-resolution representation is similar to the scene descrip-

tor analysis. The scene descriptors are computed quickly andgive a sense of the context. The same is also confirmed bythe human studies conducted by Li et al. [35]. The secondstream deals with salient objects, which in our problem isthe face analysis framework.

D. Multiple Kernel Learning for Fusion

We wish to fuse: BoWAU , BoWLL, SceneGIST andSceneCENTRIST . A trivial way is to perform feature fu-sion. However, it is not guaranteed that the complimentaryinformation will be captured. There may be no increase in theperformance of a multi-modal system due to increase in thefeature dimension. One obvious option is to apply some fea-ture selection / dimension reduction method. Decision levelfusion could also be more promising. Recently, [10] havesuccessfully used MKL for audio-video emotion recognition.MKL computes a linear combination of base kernels. We usethe MKL method of [36], as it poses the MKL problem as aconvex optimisation problem, which guarantees an optimalsolution.

We represent each modality using a single kernel. Thenotations below are similar to [10]. Given a training set withN group images, let us denote the M feature sets (modal-ities) as Xm, m ∈ {1, 2, ..,M}. The labels are representedas yi ∈ {−1, 1} for i = 1, ..., N . For each modality, wecompute a Histogram Intersection Kernel (HIK), which canbe defined as k(xi,xj) =

∑ni=1 min (xi, xj). The dual

formulation of SVM optimisation problem can be writtenas:

maxα,β

N∑i=1

αi −N∑

i,j=1

αiαjyiyjK(xi, xj)

(1)

N∑i

αiyi = 0; 0 ≤ αi ≤ C

K � 0

Page 6: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

Feature Positive Neutral Negative FinalBoW AU 70.93 33.33 37.93 50.43BoW LL 76.74 56.66 06.90 50.98Scene GIST 52.32 38.33 31.03 42.16Scene CENTRIST 50.00 45.00 39.65 45.58

TABLE ICLASSIFICATION ACCURACIES (IN %) ON THE THREE-CLASS TASK FOR

INDIVIDUAL MODALITIES.

where K is defined as the convex combination of all featurekernels:

K =

M∑m

βmkm (2)

M∑m

βm = 1

βm ≥ 0 ∀m

Grid search is used to find the parameter C. α and β areautomatically learnt by MKL [36].

VI. EXPERIMENTAL ANALYSIS

Face processing pipeline: Face and fiducial points aredetected using the publicly available2 MoPS library. A 146-parts model is used for fitting. A total of 1756 faces weredetected in 417 images. The data are divided into two sets.Affine warp is computed on the detected faces. Alignedface are downsized to 128 × 128 size. For the PHOGdescriptor3, bin size = 16, orientation range [0-180], pyramidlevel = 3. For the LPQ descriptor, rotation invariant LPQ4

is computed. Feature fusion is performed using LPQ andPHOG. To learn dictionaries for BoWLL and BoWAU , awide range of dictionary sizes are experimented with. Therange of dictionary size is [8-128]. The final dictionary sizefor BoWAU and BoWLL are 64 and 128, respectively.

Scene descriptors: The publicly available GIST imple-mentation5 is used with its default parameters: Orientationsper scale = [8 8 8 8]; number of blocks = 4. Similarly for theCENTRIST descriptor, publicly available implementation6 isused.

The LibSVM library [37] with HIK is used for learningclassifiers for individual modalities. The cost parameters areset empirically using grid search. Table I shows the clas-sification accuracy for the individual descriptors: BoWAU ,BoWLL, SceneGIST , SceneCENTRIST . It is interesting tosee that high-level features based on AU perform similar tothe low-level feature combination. The classification accu-racy for the Negative class is lower for the other two classes.On analysing the confusion matrices, it was found that the

2http://www.ics.uci.edu/˜xzhu/face/3http://www.robots.ox.ac.uk/˜vgg/research/

caltech/phog.html4http://www.cse.oulu.fi/CMV/Downloads/LPQMatlab5http://people.csail.mit.edu/torralba/code/

spatialenvelope/6https://github.com/sometimesfood/spact-matlab

Feature Positive Neutral Negative FinalBoW LL + BoW AU+ Scene GIST 63.95 38.33 46.55 51.47

BoW LL + BoW AU 86.04 31.66 20.68 51.47BoW LL + BoW AU+Scene CENTRIST 51.12 48.33 44.82 48.52

MKL - BoW LL +BoW AU + Scene GIST

82.55(0.0083)

78.33(0.7993)

50.00(0.1924) 67.15

MKL - BoW LL +BoW AU + Scene CENTRIST

83.72(0.0085)

80.00(0.7976)

31.03(0.1938) 67.64

TABLE IICLASSIFICATION ACCURACIES (IN %) WHEN FUSING THE BEST

PERFORMING INDIVIDUAL MODALITIES. FOR MKL, THE VALUES INSIDE

THE BRACKET ARE THE LEARNT KERNEL WEIGHTS.

number of Negative samples getting incorrectly classifiedas Positive or Neutral is almost the same. Further investi-gation revealed one of the possible reasons. For collectingthe Negative images from the internet, keywords such as‘protest’, ‘violence’, ‘unhappy’ etc. were used. Generally, inscenarios like this, subjects are not directly posing in frontof the camera. Their body gesture may also be intrudingwith face visibility. For e.g. in protest images, members of agroup, generally, raise their arms and may hold placards. Thisoccludes the face at times. We observed that there are morenon-frontal faces in the Negative class. This leads to errorin the face alignment step and the error propagates throughthe system. Chew et al. [38] argue that a small error in facefitting can be compensated with good descriptors. Probably,that is the reason why BoWAU ’s Negative class performsbetter than that of BoWLL. CERT has individual detectorsfor each AU.

For the fusion of the individual modalities, we performedfeature fusion and fusion using MKL. The MKL implemen-tation is publicly available7. Table II shows the classificationaccuracy output. As hypothesised, MKL performs betterthan feature fusion. The MKL based fusion of BoWAU ,BoWLL and SceneCENTRIST performs the best out of allthe tested configurations. MKL based on BoWAU , BoWLL

performs less than MKL based fusion of BoWAU , BoWLL

and SceneCENTRIST , it is to be understood here, that giventhe ‘in the wild nature of the images, face detection may fail,so we need to add complementary information, which weobtain from the scene level descriptors. The weights learntfor the three kernels are mentioned in Table II.

VII. CONCLUSIONS

In this paper, a novel framework for analysing affect of agroup of people in an image has been presented. An ‘in thewild’ image based database labelled with perceived affectof group of people is collected from the Internet based onkeywords search. In order to understand the factors thateffect the perception the of emotion in a group, a userstudy is consulted. A Multiple Kernel Learning based hybridaffect inference method is proposed. Top-down informationis extracted using scene descriptors. Bottom-up informationis extracted by analysing the face of the members of a

7http://www.cse.msu.edu/˜bucakser/software.html

Page 7: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

group. Facial Action Unit based Bag of Words featuresare augmented with low-level based Bag of Words features.The MKL framework considers each feature modality as aseparate Histogram Intersection Kernel and performs betterthan feature fusion methods. This supports the hypothesisof the paper that facial information (bottom-up) and sceneinformation (top-down) together help in inferring the affectconveyed by a group. The framework is backed by the modelproposed by Bar [34] and the survey (Section IV). To thebest of our knowledge, this work is the first to analyse bothpositive and negative affect of groups of people in images atsocial events.

Affect analysis of groups of people is a non-trivial prob-lem. Large heterogeneity due to uniqueness of individuals’expression of emotion poses one of the biggest challenges inassessing the group emotion. Recently, body pose analysisfor inferring affect [39] has got much attention. An interest-ing extension is to fuse body pose information. Along withproviding new information about the affect, this may helpin scenarios where the face detection is not accurate dueto pose, occlusion or blur. To deal with non-frontal faces,head pose normalisation methods such as the ones basedon MoPS [40] (as in Section VI face and fiducial pointsare detected using MoPS) can be used. Explicit modellingof attributes such as clothes style and colours and kinshiprelationship can also give important information about thesocial event and hence, can aid in inferring the affect of agroup of people in an image. Currently, the affect classes inthe group affect database are positive, neutral and negative.In future, the database will be extended and finer emotionlabels and intensities will be added.

VIII. ACKNOWLEDGEMENT

Nicu Sebe has been supported by the MIUR FIRB project S-PATTERNS and by the MIUR Cluster project Active Ageingat Home.

REFERENCES

[1] S. G. Barsade and D. E. Gibson, “Group emotion: A view from top andbottom,” Deborah Gruenfeld, Margaret Neale, and Elizabeth Mannix(Eds.), Research on Managing in Groups and Teams, vol. 1, pp. 81–102, 1998.

[2] J. R. Kelly and S. G. Barsade, “Mood and emotions in small groupsand work teams,” Organizational behavior and human decision pro-cesses, vol. 86, no. 1, pp. 99–130, 2001.

[3] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A Survey of AffectRecognition Methods: Audio, Visual, and Spontaneous Expressions,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 31, no. 1, pp. 39–58, 2009.

[4] M. F. Valstar, I. Patras, and M. Pantic, “Facial action unit detectionusing probabilistic actively learned support vector machines on trackedfacial point data,” in Proccedings of the Conference on ComputerVision and Pattern Recognition-Workshops (CVPRW), pp. 76–76,2005.

[5] J. Whitehill, G. Littlewort, I. R. Fasel, M. S. Bartlett, and J. R.Movellan, “Toward Practical Smile Detection,” IEEE Transaction onPattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 2106–2111, 2009.

[6] H. Joho, J. Staiano, N. Sebe, and J. M. Jose, “Looking at the viewer:analysing facial activity to detect personal highlights of multimediacontents,” Multimedia Tools and Applications, vol. 51, no. 2, pp. 505–523, 2011.

[7] T. Gehrig and H. K. Ekenel, “Facial action unit detection usingkernel partial least squares,” in Proceedings of the IEEE InternationalConfernece on Computer Vision and Workshops (ICCW), pp. 2092–2099, 2011.

[8] A. Dhall, J. Joshi, I. Radwan, and R. Goecke, “Finding HappiestMoments in a Social Context,” in Proceedings of the Asian Conferenceon Computer Vision (ACCV), pp. 613–626, 2012.

[9] F. Zhou, F. de la Torre, and J. F. Cohn, “Unsupervised Discovery ofFacial Events,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 2574–2581, 2010.

[10] K. Sikka, K. Dykstra, S. Sathyanarayana, G. Littlewort, andM. Bartlett, “Multiple kernel learning for emotion recognition in thewild,” in Proceedings of the ACM on International Conference onMultimodal Interaction (ICMI), pp. 517–524, 2013.

[11] D. Gatica-Perez, “Automatic nonverbal analysis of social interactionin small groups: A review,” Image and Vision Computing, vol. 27,no. 12, pp. 1775–1787, 2009.

[12] W. Ge, R. T. Collins, and B. Ruback, “Vision-based analysis of smallgroups in pedestrian crowds,” IEEE Transaction on Pattern Analysis& Machine Intelligence, vol. 34, no. 5, pp. 1003–1016, 2012.

[13] J. Hernandez, M. E. Hoque, W. Drevo, and R. W. Picard, “Mood meter:counting smiles in the wild,” in Proceedings of the ACM Conferenceon Ubiquitous Computing, pp. 301–310, 2012.

[14] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large,richly annotated facial-expression databases from movies,” IEEE Mul-timedia, vol. 19, no. 3, p. 0034, 2012.

[15] A. C. Gallagher and T. Chen, “Understanding Images of Groups ofPeople,” in Proceedings of the IEEE Confernece on Computer Visionand Pattern Recognition (CVPR), pp. 256–263, 2009.

[16] J. Lu, J. Hu, X. Zhou, Y. Shang, Y.-P. Tan, and G. Wang, “Neighbor-hood repulsed metric learning for kinship verification,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 2594–2601, IEEE, 2012.

[17] Z. Stone, T. Zickler, and T. Darell, “Autotagging facebook: Social net-work context improves photo annotation,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),pp. 1–8, 2008.

[18] A. Dhall, R. Goecke, and T. Gedeon, “Automatic group happinessintensity analysis,” IEEE Transactions on Affective Computing, 2015.

[19] A. C. Murillo, I. S. Kwak, L. Bourdev, D. J. Kriegman, andS. Belongie, “Urban tribes: Analyzing group photos from a socialperspective,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition and Workshops (CVPRW), pp. 28–35,2012.

[20] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg, “Hipsterwars: Discovering elements of fashion styles,” in Proceedings of theEuropean Conference on Computer Vision (ECCV), pp. 472–488,Springer, 2014.

[21] G. Wang, A. C. Gallagher, J. Luo, and D. A. Forsyth, “Seeing peoplein social context: Recognizing people and social relationships,” inProceedings of the European Conference on Computer Vision (ECCV),pp. 169–182, 2010.

[22] A. Torralba and P. Sinha, “Statistical context priming for objectdetection,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pp. 763–770, 2001.

[23] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database forfacial expression analysis,” in Proceedings of the IEEE InternationalConference on Automatic Face and Gesture Recognition (FG), pp. 46–53, 2000.

[24] M. Pantic, M. F. Valstar, R. Rademaker, and L. Maat, “Web-baseddatabase for facial expression analysis,” in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME), pp. 5–pp,2005.

[25] D. McDuff, R. El Kaliouby, T. Senechal, M. Amr, J. F. Cohn, andR. Picard, “Affectiva-mit facial expression dataset (am-fed): Natural-istic and spontaneous facial expressions collected” in-the-wild”,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPR), pp. 881–888, IEEE, 2013.

[26] X. Zhu and D. Ramanan, “Face detection, pose estimation, and land-mark localization in the wild,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pp. 2879–2886,2012.

[27] G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank, J. Movellan, andM. Bartlett, “The computer expression recognition toolbox (cert),” in

Page 8: The More the Merrier: Analysing the Affect of a Group of ...users.cecs.anu.edu.au/~adhall/Dhall_Joshi_Sikka_Goecke_Sebe_FG_… · abhinav.dhall@anu.edu.au, Jyoti.joshi@canberra.edu.au,

Proceedings of the IEEE International Conference on Automatic Face& Gesture Recognition (FG), pp. 298–305, 2011.

[28] A. Bosch, A. Zisserman, and X. Munoz, “Representing Shape witha Spatial Pyramid Kernel,” in Proceedings of the ACM internationalconference on Image and video retrieval (CIVR), pp. 401–408, 2007.

[29] V. Ojansivu and J. Heikkila, “Blur Insensitive Texture ClassificationUsing Local Phase Quantization,” in Proceedings of the Image andSignal Processing (ICISP), pp. 236–243, 2008.

[30] A. Dhall, A. Asthana, R. Goecke, and T. Gedeon, “Emotion recog-nition using PHOG and LPQ features,” in Proceedings of the IEEEConference Automatic Faces & Gesture Recognition and Workshops(FERA), pp. 878–883, 2011.

[31] G. Tsai, C. Xu, J. Liu, and B. Kuipers, “Real-time indoor scene un-derstanding using bayesian filtering with motion cues,” in Proceedingsof the IEEE International Conference on Computer Vision (ICCV),pp. 121–128, 2011.

[32] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” International journalof computer vision, vol. 42, no. 3, pp. 145–175, 2001.

[33] J. Wu and J. M. Rehg, “Centrist: A visual descriptor for scenecategorization,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 33, no. 8, pp. 1489–1501, 2011.

[34] M. Bar, “Visual objects in context,” Nature Reviews Neuroscience,vol. 5, no. 8, pp. 617–629, 2004.

[35] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, “What do we perceivein a glance of a real-world scene?,” Journal of vision, vol. 7, no. 1,p. 10, 2007.

[36] S. Bucak, R. Jin, and A. K. Jain, “Multi-label multiple kernel learningby stochastic approximation: Application to visual object recognition,”in Advances in Neural Information Processing Systems, pp. 325–333,2010.

[37] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vectormachines,” 2001. http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[38] S. W. Chew, P. Lucey, S. Lucey, J. M. Saragih, J. F. Cohn, I. Matthews,and S. Sridharan, “In the pursuit of effective affective computing: Therelationship between features and registration,” IEEE Transactions onSystems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 4,pp. 1006–1016, 2012.

[39] A. Kleinsmith and N. Bianchi-Berthouze, “Affective body expressionperception and recognition: a survey,” IEEE Transactions on AffectiveComputing, vol. 4, no. 1, pp. 15–33, 2013.

[40] A. Dhall, K. Sikka, G. Littlewort, R. Goecke, and M. Bartlett,“A Discriminative Parts Based Model Approach for Fiducial PointsFree and Shape Constrained Head Pose Normalisation In The Wild,”in Proceedings of the IEEE Winter Conference on Applications ofComputer Vision, pp. 1–8, 2014.