emotion-modulated attention improves expression ...€¦ · our model consists of a deep...

Neurocomputing 253 (2017) 104–114

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Emotion-modulated attention improves expression recognition: A

deep learning model

Pablo Barros ∗, German I. Parisi , Cornelius Weber , Stefan Wermter

Department of Informatics, University of Hamburg, Knowledge Technology, Vogt-Koelln-Strasse 30, Hamburg 22527, Germany

a r t i c l e i n f o

Article history:

Received 13 May 2016

Revised 30 December 2016

Accepted 23 January 2017

Available online 11 March 2017

Keywords:

Convolutional neural networks

Deep learning

Multimodal processing

Emotional attention

Emotion recognition

a b s t r a c t

Spatial attention in humans and animals involves the visual pathway and the superior colliculus, which

integrate multimodal information. Recent research has shown that affective stimuli play an important role

in attentional mechanisms, and behavioral studies show that the focus of attention in a given region of

the visual field is increased when affective stimuli are present. This work proposes a neurocomputational

model that learns to attend to emotional expressions and to modulate emotion recognition. Our model

consists of a deep architecture which implements convolutional neural networks to learn the location of

emotional expressions in a cluttered scene. We performed a number of experiments for detecting regions

of interest, based on emotion stimuli, and show that the attention model improves emotion expression

recognition when used as emotional attention modulator. Finally, we analyze the internal representations

of the learned neural filters and discuss their role in the performance of our model.

© 2017 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license.

( http://creativecommons.org/licenses/by-nc-nd/4.0/ )

f

o

s

n

v

o

s

v

e

e

s

e

[

h

s

i

p

h

s

p

1. Introduction

Visual spatial attention allows animals and humans to process

relevant environmental stimuli while suppressing irrelevant infor-

mation. Several brain areas and neural mechanisms have been

identified to be involved in the processing of spatial attention dur-

ing perception [9] . For instance, it has been found that the superior

colliculus (SC) – a midbrain structure responsible for the integra-

tion of audiovisual stimuli – plays a crucial role in spatial attention,

more specifically in the process of target selection and estimating

motor consequences such as saccades, i.e. quick eye movements to

control the direction of fixation [24] . The integration of audiovisual

stimuli in the SC has been extensively investigated from a neu-

rophysiological perspective [30] , with different computational ap-

proaches modeling the integration of multiple perceptual cues for

triggering spatial attention in line with neurobehavioral evidence

[4] .

Converging findings suggest that selective attention is modu-

lated by the affective significance of sensory inputs [32] . More

specifically, it has been argued that emotional salience has a di-

rect influence on attention and that neural processes responsible

∗ Corresponding author.

E-mail addresses: [email protected] (P. Barros),

[email protected] (G.I. Parisi), [email protected]

(C. Weber), [email protected] (S. Wermter).

s

r

i

a

w

http://dx.doi.org/10.1016/j.neucom.2017.01.096

0925-2312/© 2017 The Authors. Published by Elsevier B.V. This is an open access article u

or emotional attention may supplement and even compete with

ther top-down mechanisms of perception. Behavioral studies have

hown that people pay more attention to emotional rather than

eutral stimuli and that these effects often are reflexive and in-

oluntary, e.g. visual targets expressing an emotion such as happy

r angry are found faster among distractors than targets without

uch emotional values [10,35] . Phelps et al. [27] showed that the

isual detection threshold for low-contrast stimuli is improved if

motional cues are present, thus suggesting the existence of an

motion-driven mechanism to capture spatial attention. Additional

tudies suggest that in the case of limited attentional resources,

motional information is prioritized over non-affective cues

13,33] . These findings together indicate that emotional salience

as a strong role in capturing attention, and emotional bias is also

ubject to a set of different non-affective regulatory effects.

Converging findings suggest that selective attention, processed

n the SC, is modulated by the affective significance of sensory in-

uts [32] . In particular, it has been argued that emotional salience

as a direct influence on attention and that neural processes re-

ponsible for emotional attention may supplement and even com-

ete with other top-down mechanisms of recognition. Behavioral

tudies have shown that people pay more attention to emotional

ather than neutral stimuli and that these effects often are reflex-

ve and involuntary, e.g. visual targets expressing an emotion such

s happy or angry are found faster among distractors than targets

ithout such emotional values [33,35] .

nder the CC BY-NC-ND license. ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )


http://www.ScienceDirect.com

http://www.elsevier.com/locate/neucom

http://crossmark.crossref.org/dialog/?doi=10.1016/j.neucom.2017.01.096&domain=pdf

http://creativecommons.org/licenses/by-nc-nd/4.0/

mailto:[email protected]





http://creativecommons.org/licenses/by-nc-nd/4.0/

P. Barros et al. / Neurocomputing 253 (2017) 104–114 105

c

r

e

a

t

s

s

i

[

c

“

e

e

c

t

t

u

a

s

e

f

p

t

o

t

W

e

u

p

a

m

t

b

b

i

e

p

a

o

a

c

i

2

l

(

p

t

e

e

p

w

c

t

h

a

a

a

t

[

e

t

f

s

f

i

s

a

e

c

a

a

S

o

t

i

2

t

s

t

i

s

v

g

v

w

b

b

o

c

w

a

r

i

d

m

a

a

w

t

c

f

2

g

A

r

T

c

s

c

t

From a human–robot interaction (HRI) perspective, different

omputational models have been proposed for the detection and

ecognition of emotional expressions [1] . Different cues may carry

motional information such as face expressions, sound (voice pitch

nd intensity), and body movements [18] . It has been shown that

he combination of these cues increases recognition accuracy [6] ,

uggesting that models for the robust processing of emotional

tates should feature multimodal properties for the meaningful

ntegration of a set of available perceptual cues. In this context,

12] established six universal emotions that exhibit invariance to

ultural and racial factors: “Anger”, “Disgust”, “Fear”, “Happiness”,

Surprise”, and “Sadness”. However, for some HRI applications,

motional states were classified in terms of positive or negative

motions for triggering pro-active robot behaviors [3] .

With the advance of deep learning networks, most of the re-

ent work involves the use of neural architectures. The ones with

he best performance and generalization apply different classifiers

o different descriptors [7,22,26] , and there is no consensus on a

niversal emotion recognition system. Most of these systems are

pplied to one modality only and reach a good performance for

pecific tasks. However, all these works rely on face detection mod-

ls, which are not related to emotion recognition. Although such

ace detection models work well in controlled scenarios, they show

oor performance when applied in complex scenarios [25,31] .

As an extension of previous work aiming at emotion recogni-

ion [2] , in this paper we investigated the modulation mechanisms

f emotion-driven attention and implemented a deep neural archi-

ecture for the detection of emotional stimuli in a natural scene.

e trained our model to distinguish between neutral and happy

xpressions conveyed by facial features and body movement and

sed this information as a modulator to our perception model, im-

roving its recognition capabilities.

We show that although the input is composed of a single im-

ge sequence containing both, face and body movement cues, the

odel will autonomously learn separate cue-specific filters. In con-

rast to traditional deep learning models using discrete target la-

els for modulating the learning process, we use probability distri-

utions that allow the model to estimate the location of interest,

.e. the region in the image that triggers selective attention. Inter-

stingly, after using teaching signals with only one emotional ex-

ression in the image, experiments have shown that the model is

ble to produce congruent probability distributions for more than

ne expression present in the scene. We evaluated our system with

bi-modal face and body benchmark dataset, showing that the

ombination of facial properties and body movements significantly

mproves the detection of emotion-relevant areas in the image.

. Deep emotional attention model

Our model combines the idea of hierarchical learning and se-

ective emotional attention using convolutional neural networks

CNN). Our approach differs from traditional CNN-based ap-

roaches by two factors: first, the input stimuli are composed of

he whole scene, which may or may not contain people expressing

motions. Second, the network is trained to (a) localize where the

motion expression is and (b) identify if the detected emotion ex-

ression is interesting enough to attract the attention of the model.

CNNs were used for several visual recognition tasks, starting

ith the Neocognitron proposed by Fukushima [15] . In most of the

ases the CNNs were used to learn hierarchical descriptors from

he input stimuli and describe the input data in a smaller, but

ighly abstract representation. In our work, we do not use the CNN

s a classification technique. We employ the convolutional units

s feature descriptors, but instead of learning hierarchical contours

nd shapes, which is commonly used in general image recognition

asks, they learn spatial information as discussed by Speck et al.

29] .

We use our model to detect emotional events conveyed by face

xpressions and body movements. In this scenario, each convolu-

ional unit learns how to process facial and movement features

rom the whole image. We differed our work from simple clas-

ification tasks by not tuning the convolutional units to describe

orms, but rather to identify where an expression is located in the

mage. Therefore, we implement a hierarchical localization repre-

entation where each layer deals with a sub-region of the im-

ge. The first layers will learn how to detect Regions of Inter-

st (ROI) which will then be fine-tuned in the deeper layers. Be-

ause the pooling units increase the spatial invariance, we only

pply them in our last layers, which means that our first layers

re only composed of convolutional units stacked together. In this

ection 2 we describe our model, starting with common CNNs and

ur emotional attention model. To train our network as a localiza-

ion model, we use a different learning strategy based on probabil-

ty density functions, which will also be explained in this section.

.1. Convolutional neural networks

Each layer of the CNN has a set of different convolutional units

hat increase the capability to learn different features from the

ame region in the image. This operation generates different fil-

ered outputs, or filter maps, one for each unit. The pooling units

n each of these filter maps are generating spatial invariance. Each

et of filters acts in a receptive field in the input stimuli. The acti-

ation of each unit v xy nc at ( x, y ) of the n th filter in the c th layer is

iven by

xy nc = max

(

b nc +

∑

m

H ∑

h =1

W ∑

w =1

w

hw

(c−1) m

v (x + h )(y + w ) (c−1) m

, 0

)

, (1)

here max ( ·, 0) represents the rectified linear function, shown to

e effective for training deep neural architectures [16] , b nc is the

ias for the n th feature map of the c th layer, m indexes over the set

f feature maps in the ( c −1) layer connected to the current layer

, w

hw

(c−1) m

is the weight of the connection between the unit ( h,

) within a receptive field, connected to the previous layer c − 1 ,

nd to the filter map m. H and W are the height and width of the

eceptive field.

In the pooling layers, a receptive field of the previous filter map

s connected to a pooling unit in the current layer, reducing the

imensionality of the feature maps. The pooling units generate the

aximum activation of the receptive field u ( x, y ) which is defined

s:

j = max n ×n ( v nc u (x, y ) ) , (2)

here v nc is the output of the convolutional unit. In this function,

he pooling unit computes the maximum activation among the re-

eptive field u ( x, y ). The maximum operation down-samples the

eature map maintaining the input structure.

.2. Sequence processing

In a traditional CNN, convolutional units are connected to a sin-

le input stimulus generating a representation of that single input.

s we are dealing with sequential data, our representation is a se-

ies of stimuli which may affect the representation of each other.

herefore, we use the cubic receptive fields, implemented in the

onvolutional units by stacking filters together to be applied in the

ame region of different input stimuli [21] . Then, it is possible to

reate a representation of the whole sequence, which will be tuned

o adapt to the presence of patterns in the sequence as a whole

106 P. Barros et al. / Neurocomputing 253 (2017) 104–114

d

i

r

t

o

t

i

i

p

t

l

l

n

a

S

t

r

t

h

p

H

t

c

c

e

c

i

t

a

b

l

s

p

w

t

v

h

i

t

l

i

a

t

2

w

m

l

t

a

i

e

4

i

a

w

s

T

s

fi

l

and not in individual stimuli. With z indexing the image stack, the

value of each unit ( x, y, z ) at the n th filter map in the c th layer is

defined becomes:

v xyz nc = max (b nc +

∑

m

H ∑

h =1

W ∑

w =1

R ∑

r=1

w

hwr (c−1) m

v (x + h )(y + w )(z+ r) (m −1)

, 0) , (3)

where w

hwr (c−1) m

is the weight of the connection between the unit

( h, w, r ) within a receptive field connected to the previous layer

( c − 1 ) and the filter map m. R is the number of images stacked

together representing the new dimension of the receptive field.

In our model, one unit is connected directly to the same region

in a sequence of input stimuli, but having different weights per el-

ement in the sequence. This gives the model the capacity to iden-

tify pattern variations in the same regions in the image, detecting

shapes or forms that are in the same region in the image. This way,

the convolutional units can identify if an object is moving by the

change of the response of the neurons in neighbor regions. By tun-

ing these filters to learn a specific shape, it is possible to identify

objects or, in our case, expressions through time. A similar process

was discussed by Wallis et al. [34] and Wiskott and Sejnowski [36] .

2.3. Shunting inhibition

Typically, CNN-based approaches comprise several layers

stacked together and many filters, thus leading to a large num-

ber of parameters to be trained. This makes the network hard to

train and requires a huge amount of data so that all these filters

learn meaningful information. To reduce the necessity of many lay-

ers, we use shunting inhibitory fields [14] in the last layers of our

localization network. Shunting inhibitory neurons are neurophysi-

ologically plausible mechanisms that are present in several visual

and cognitive functions [17] . When applied to convolutional units,

shunting neurons can result in filters which are more robust to ge-

ometric distortions, meaning that the filters learn more high-level

features. Each shunting neuron S xy nc at the position ( x, y ) of the n th

receptive field in the c th layer is activated as:

S xy nc =

u

xy nc

a nc + I xy nc

, (4)

where u xy nc is the activation of the common unit in the same posi-

tion and I xy nc is the activation of the inhibitory neuron. The weights

of each inhibitory neuron are trained with backpropagation. The

passive decay term a nc , is a trained parameter and it is the same

for the whole shunting inhibitory field.

We apply cubic full convolution in the first layer which pro-

duces a 2D output because of the convolution padding, meaning

that, starting from the second layer, the outputs are 2D vectors.

This means that the shunting neurons are applied as a 2D filter

map, without the z index.

The motivation of applying shunting neurons is to specify the

filters of a layer. The shunting neurons act as an over-specification

tool, which makes the filters in a layer learns complex patterns.

When applied to low-level features such as edges and contours,

the shunting neurons tend to destroy the generalization properties

of deep neural networks. However, when applied to high-level fea-

tures with a very abstract specification, the shunting neurons tend

to filter noise and learn only the most relevant features. We apply

the shunting neurons in the last layer of our network, thus assur-

ing that only the higher abstraction level is highly specified.

2.4. Emotional attention

Vuilleumier [32] described the processes of emotional attention

in the human brain for developing an emotional salience model

based on visual attention, in which different expressions introduce

ifferent reactions to specific neurons in the brain, thus suggest-

ng that expressive emotions such as angry or happy expressions

eceive more attention than neutral faces. To achieve the localiza-

ion processing, we build our architecture in a different way: we

nly apply pooling operation in the last layer. Pooling usually in-

roduces space invariance, which is in contrast with our concept of

dentifying a region of interest. Therefore, pooling is only applied

n the last convolutional two layers, i.e. after we have identified a

ossible interest region by filtering information with the convolu-

ional units. We apply shunting neurons in the last convolutional

ayer.

We observed that the use of the shunting neurons in the last

ayer caused an interesting effect in our network. As we feed the

etwork with a full image and expect it to output a location, we

re training the filters to detect unlabeled emotion expressions.

ince the first two layers have no pooling, the filters are applied to

he whole image to specify where possible expressions are occur-

ing. By applying the shunting neurons on this layer, we increase

he specification of the layer and assure that it can learn, indirectly,

ow to distinguish between different expressions.

CNNs are usually trained using strongly labeled classes, which

rovides an interpretation of the features learned by the filters.

owever, this also implicitly guides the learning process to shape

he learned filters to detect features which are important for each

lass. For an example, the concept of training the network to detect

ategorical expressions such as anger, happiness and sadness, will

nforce the filters to detect features which are important for such

lasses. To train our network we use a different strategy. In a local-

zation scheme, we train our network to create salience regions on

he input stimuli. This way, we do not have specific classes, such

s the ones mentioned above. The model is trained using a proba-

ility distribution (instead of a class attribution) that indicates the

ocation of interest.

Our model has two output layers, each one responsible for de-

cribing positions on the 2D visual field. One of the output layers

rovides the positions on the X -axis and the other on the Y -axis,

hich means that each output layer has different dimensions. Fur-

hermore, using a probability distribution allows our model to con-

erge faster since the idea of having a precise position would be

ard to learn. The shape of the distribution changes, e.g. depend-

ng on how close the expression is being performed with respect

o the camera. Using a probability distribution as output also al-

ows us to identify other interesting regions in the images, allow-

ng concurrent events to happen at the same time. Fig. 1 illustrates

n input sample to the network and the correspondent output dis-

ribution probabilities.

.4.1. Attention architecture

Our model is illustrated in Fig. 2 . We use as input a sequence

ith 10 images, each one with a size of 400 × 300 pixels. That

eans that with a frame rate of 30 frames per second, our model

ocalizes expressions with approximately 300 ms, which is the in-

erval of the duration of a common expression, between 200 ms

nd 1 s [11] . It is important for us to deal with sequences, as static

mages alone do not carry much of spontaneous characteristics of

xpressions [28] .

Our model implements 4 convolution layers, the first three with

filters and dimensions of 9 × 9 in the first layer and 5 × 5

n the second and third. A fourth convolution layer with 8 filters

nd a dimension of 3 × 3 is preceded and followed by pooling,

ith receptive field sizes of 8 × 8 and 2 × 2 respectively. These

izes were found to perform best based on empirical evaluations.

he last convolutional layer, labeled Conv4 in Fig. 2 ) implements

hunting neurons. Here we use a pyramidal structure where the

rst layers apply a few filters with large receptive fields, and the

ast layers have more filters with smaller receptive fields. This is


Fig. 1. Example of output of the teaching signal. For each sequence of images, rep-

resenting a scene with one or more expressions happening, two probability distri-

butions are used as teaching signal, describing the region of interest by its position

on the x - and y -axis, respectively.

n

fi

c

e

v

c

m

s

a

s

p

e

t

E

l

u

a

t

p

n

2

o

T

i

c

A

w

v

a

p

m

l

t

t

[

n

s

l

p

v

a

p

S

w

i

t

i

f

p

t

n

t

z

p

v

w

a

t

t

t

F

i

h

ecessary because the role of the first layers is to detect, and thus

lter, possible interest regions in the whole image. The larger re-

eptive fields help filter large chunks of the image where noise is

asily filtered. The last layer applies a smaller filter, which looks for

ery specific information (face expressions and body movements).

The last pooling layers are connected to two separated fully

onnected hidden layers, which are then connected to two soft-

ax output layers. We also normalize the teaching signal with a

oftmax function before using it for the network training. By sep-

rating the fully connected hidden layer, we make sure that the

ame representation will be used for both outputs, but with inde-

endent connections. Furthermore, the network is trained with an

rror calculated from both fully connected hidden layers, meaning

hat each output influences the training of the convolution layers.

ach fully connected hidden layer has 200 neurons and the output

ayers have 300 and 400 output units, one for each row and col-

mn of pixels in the image input. The model applies the L 2 norm

nd dropout as regularization methods using momentum during

he training with an adaptive learning rate. The input images are

re-processed with ZCA-whitening [8] before used as input to the

etwork.

ig. 2. Proposed attention model. Our architecture has four convolution layers and onl

llustrated with its filters, and the arrows illustrate how the layers connect to each othe

idden layers, which are connected with output units, computing the x and y location of

.5. Network representation

Our model deals with multimodal emotional attention in terms

f expressions composed of face expressions and body movement.

he input of our network is a sequence of images from which

t will detect ROI representing possible emotion expressions. It is

hallenging to identify what the filters of a CNNs actually learn.

s a neural network, the knowledge of the model is stored in its

eights and, in this case, in the filters. Every filter in the con-

olution layers learns to detect certain patterns in input stimuli

nd, because of the pooling operation, the deeper layer learns

atterns which represent a far larger region of the input. This

eans that the observation of the filters does not provide a re-

iable way to evaluate the knowledge of the network. To be able

o identify what the network learned and how it distinguishes be-

ween different visual cues, we deploy the deconvolution process

37] .

In the deconvolution process for visualizing the activation of a

euron a in layer l ( a l ), an input is fed to the network and the

ignal is forwarded. Afterward, the activation of every neuron in

ayer l , except for a , is set to zero. After that, each convolution and

ooling operation of each layer is reversed. The reverse of the con-

olution, named unfiltering, is done by reversing the connections

nd flipping the filters horizontally and vertically. The unfiltering

rocess can be defined as

f =

M ∑

m =1

H ∑

h =1

W ∑

w =1

w f hw

(c−1) m

u

(x + h )(y + w ) (c−1) m

, (5)

here w f hw

(c−1) m

represents the flipped filters.

The reverse of the max-pooling operation is defined as unpool-

ng. Zeiler and Fergus [37] showed that it is necessary to consider

he position of the maximum values in order to improve the qual-

ty of the visualizations, so that these values are stored during the

orward-pass. For unpooling, non-max values are set to zero. Back-

ropagating the activation of a neuron will cause the reconstruc-

ion of the input in a way that only the region which activates that

euron is visualized. Our CNNs use rectified linear units, meaning

hat the neurons which are activated output positive values, with

ero representing no activation. In our reconstructed inputs, bright

ixels indicate the importance of that specific region for the acti-

ation of the neuron.

The deconvolution process helps us understand how the net-

ork deals with the input stimuli. As we are dealing with im-

ge sequences, which contain – among other things – emo-

ional expressions, the network tends to learn very different fea-

ures. The teaching signal is composed of possible ROI, thus

raining the network to implicitly detect emotion expressions. As

y the two last ones are followed by a pooling layer. In this figure, each layer is

r. The convolution layers are fully connected with two separated fully connected

the possible expression.


Fig. 3. Implemented CCCNN. This network has two channels, one learning features from the face expression only, and the other from the whole body movement. At the end,

the network implements a Cross-channel which integrates both representations into one emotional concept.

w

t

T

n

e

r

C

p

t

4

s

i

b

t

fi

f

v

l

p

s

d

t

F

e

s

fi

d

C

c

e

t

i

e

t

n

t

b

h

o

m

f

t

u

t

we do not use labels to identify these expressions, the repre-

sentation of different expressions and the different cues emerge

among the network filters. We use the visualizations to illus-

trate how our network learns multi-cue information of unlabeled

expressions.

3. Perception model

As our perception model, we use the previously presented

Cross-channel Convolution Neural Networks (CCCNN). This model

introduces the use of multichannel and cross-channel learning for

emotion expression recognition and presented competitive perfor-

mance and good generalization capabilities.

The CCCNN can deal with multimodal stimuli, and we deploy

it for visual expression representation, primary face expressions

and body movements. The CCCNN extracts hierarchical features

from the two modalities and generates complex representations

which vary depending on the presented stimuli. This means that

the deeper layers will have a full representation of the input, while

the first layers will have a local representation of some regions or

parts of the input stimuli.

To be able to deal with sequences, cubic receptive fields are

implemented expanding the capability of the model into model-

ing dependencies between sequential information. Also, the use of

shunting inhibitory fields, allows us to ensure a strong feature rep-

resentation.

The proposed model is able to learn simple and complex fea-

tures and to model the dependencies of these features in a se-

quence. Using a multichannel implementation, it is possible to

learn different features for each stimulus. We use the concept of

cross-channel learning to deal with differences within the modal-

ities. This allows us to have regions of the network specified to

learn features from face expressions and body movements but the

final representation integrates both specific features.

In our CCCNN implementation, we use a Face and a Movement

channel. The Face channel is composed of two convolution and

pooling layers. The first convolution layer implements 5 filters with

cubic receptive fields, each one with a dimension of 5 × 5 × 3.

The second layer implements 5 filter maps, also with a dimension

of 5 × 5, and a shunting inhibitory field. Both layers implement

max-pooling operators with a receptive field of 2 × 2. In our ex-

periments, we use a rate of 30 frames per second, which means

that the 300 ms are represented by 9 frames. Each frame is resized

to 50 × 50 pixels.

The Movement channel implements three convolution and

pooling layers. The first convolution layer implements 5 filters with

cubic receptive fields, each one with a dimension of 5 × 5 × 3. The

second and third channels implement 5 filters, each one with a di-

mension of 5 × 5 and all channels implement max-pooling with a

receptive field of 2 × 2. We feed this channel with 1 s of expres-

sions, meaning that we feed the network with 30 frames. We com-

pute the motion representation of each 10 frames, meaning that

e feed the Movement channel with 3 motion representations. All

he images are resized to 128 × 96 pixels.

We apply a Cross-channel to the Face and Movement channels.

his Cross-channel receives as input the Face and Movement chan-

els, and it is composed of one convolution channel with 10 filters,

ach one with a dimension of 3 × 3, and one max-pooling with a

eceptive field of 2 × 2. We have to ensure that the input of the

ross-channel has the same dimension, to do so we resize the out-

ut representation of the Movement channel to 9 × 9, the same as

he Face channel. Fig. 3 illustrates the implemented CCCNN.

. Attention-modulated perception

Our model was trained with a full image, without the neces-

ity of any kind of segmentation. This means that the whole scene,

ncluding the facial expression and body movement, are processed

y the convolutional filters. After analyzing the filters learned by

he network, we could indicate how the network trained specific

lters for specific visual stimuli: a set of filters were reacting to

acial expressions and another set to body movements.

By visualizing the last layer of the network, using the decon-

olution process, we are able to illustrate what the network has

earned. Fig. 4 illustrates the visualization of different filters of the

re-last layer of the network when one input stimulus was pre-

ented. It is possible to see that the network has learned how to

etect different modalities. Filters 1–3 learned how to filter fea-

ures of the face, such as the eyes, mouth and face shape.

Filters 5–8, on the other hand, learned how to code movement.

urthermore, we can see that the network highlighted the hands,

lbows, and hand movements in the representation, while other

hapes were ignored. Another aspect of CNNs could be observed:

lters 5–8 detected partially the same movement shape, but with

ifferent activations, represented by the gray tone in the image.

NNs are known to use redundant filters in their classification pro-

ess, and the movements were depicted by more filters than face

xpressions, mainly because the movement across the frames is

he most salient feature. Filter 4, which did not pick up any mean-

ngful information, did not contribute to the localization of this

xpression.

The two-stage hypothesis of emotional attention [5] states

hat attention stimuli are first processed as a fast-forward sig-

al by the amygdala complex, and then used as feedback for

he visual cortex. This theory states that there are many feed-

ack connections between the visual cortex and the amygdala,

owever, to simplify our attention mechanism, we will use only

ne-side modulation: from the attention model to our perception

odel.

Based on this theory, our attention modulation starts as a fast-

orward processing from the attention model, and then use the de-

ected region as input for the perception model. Finally, we make

se of specific features in the attention model as a modulator for

he perception model. An image is fed to our attention CNN, and


Fig. 4. Examples of what each filter in the last layer of the network reacts to when

presenting the stimuli shown on the left. We can see that filters 1–3 focus on face

features, filter 4 does not pick up any meaningful information from this expression,

and filters 5–8 focus on movement. In this figure, filters 1–4 are inverted for better

visualization.

a

s

e

c

l

f

t

n

i

T

l

t

o

a

c

t

t

a

t

t

l

r

e

a

o

a

t

c

o

5

w

Table 1

Number of videos available for each emotional state in the

FABO dataset. Each video has 2–4 executions of the same

expression.

Emotional state Videos Emotional state Videos

Anger 60 Happiness 28

Anxiety 21 Puzzlement 46

Boredom 28 Sadness 16

Disgust 27 Surprise 13

Fear 22 Uncertainty 23

e

b

T

t

e

d

d

t

t

r

i

f

p

h

w

“

p

b

i

c

5

a

d

h

l

r

a

c

s

o

g

t

p

e

q

5

m

s

“

o

t

s

t

v

m

m

n interest region is obtained. However as shown in the previous

ection, our attention model has specific features which detect face

xpressions and body movements. To integrate both models, we

reate a connection between these filters and the second convo-

utional layer of the CCCNN. That means that our attention model

eeds specific facial and movement features to the second layer of

he CCNN. The second layer was chosen because, in the face chan-

el, it already extracts features which are similar to the ones com-

ng from the attention model, very related to final facial features.

he movement channel still needs one more convolution layer to

earn specific movement features. Fig. 5 illustrates our final atten-

ion modulation model.

We selected the face-related features and add them to the input

f the second convolution layer of the face channel. This creates

n extra set of inputs, which are correlated and processed by the

onvolution layer. By adding a new set of features to the input of

he convolution layer, we are actually biasing the representation

owards what was depicted on the attention model.

The new attention modulation model should be trained in par-

llel, with one teaching signal for each set of convolution fil-

ers. The final teaching signal is composed of the location of

he face, to update the weights of the attention-related convo-

utions, and an emotion concept, to update the weights of the

epresentation-related convolutions. While training the CCCNN lay-

rs, we also update the specific facial and movement filters of the

ttention model on the fourth convolution layer. This is done in

rder to integrate our facial and movement representations, and

ssure that the attention modulation actually has a meaning on

he learning from both models. We can also see this as a feedback

onnection, which ensures that the models can learn with each

ther.

. Experimental results

To evaluate our model, we use a meeting environment

here different emotional expressions can be displayed. For the

valuation, we use three different types of stimuli: only face, only

ody movement, and the face combined with the body movement.

o make sure that our model robustly accounts for selective atten-

ion, we evaluate the model in three different scenarios: with no

xpression, with one expression and with two expressions being

isplayed. This way, we can observe the response of the model for

ifferent combinations of emotional expressions in the scene.

To evaluate our model with spontaneous expressions, we used

he bi-modal face and body benchmark database (FABO) corpus in-

roduced by Gunes and Piccardi [19] . This corpus is composed of

ecordings of the upper torso of different subjects while perform-

ng emotion expressions. It contains a total of 10 expressions per-

ormed by 23 subjects of different nationalities. Each expression is

erformed in a spontaneous way, where no indication was given of

ow the subject must perform the expression. A total of 284 videos

as recorded, each one having 2–4 of the following expressions:

Anger”, “Anxiety”, “Boredom”, “Disgust”, “Fear”, “Happiness”, “Sur-

rise”, “Puzzlement”, “Sadness” and “Uncertainty”. The total num-

er of videos per expression are shown in Table 1 . Fig. 6 illustrates

mages present in a sequence of an angry expression in the FABO

orpus.

.1. Emotional attention dataset

To build our attention dataset, we extracted the face expression

nd body movement of the FABO corpus and located it in a ran-

om position in our meeting scene background. The expressions

ad always the same size, while the positions were randomly se-

ected. We created a dataset with 1200 expressions composed of

andomly chosen sequences from the FABO corpus using the happy

nd neutral expressions only. That means that the same expression

ould be selected more than once, but displayed at a different po-

ition in our meeting scenario.

We created three different categories of sequences: one with

nly the background, one without any expressions, one with a sin-

le expression (either neutral or happy), and one with both a neu-

ral and a happy expression. We use the neutral phase of each ex-

ression to create a neutral sequence. A total of 400 sequences for

ach category were created. Fig. 7 illustrates one frame of a se-

uence of each category.

.2. Experiments

Our model was trained in an emotion attention scenario. That

eans that if no emotion is present, the output of the model

hould be no ROI at all, thus the probability distribution for the

no emotion” condition is always 0 for the whole output vector. If

ne expression is present, we want the model to trigger higher at-

ention to displayed expressions, meaning that happy expressions

hould have a higher probability distribution with respect to neu-

ral ones. The “happy” expressions on the FABO corpus usually in-

olve facial movement, usually smiles, and arm movement while

ost of the neutral expressions are related to very little move-

ent and almost no alteration of a resting face expression. For


Fig. 5. Our attention modulation model, which uses the specific features learned from the attention architecture as inputs for the specific channels of deeper layers of the

CCCNN. In this picture, the red dotted line represents the specific facial feature maps and the black dotted line the movement feature maps.

Fig. 6. Examples of images in an angry expression in the FABO corpus.

Fig. 7. Examples of images with no expressions, one expression (happy) and two expressions (happy and neutral), used to train and evaluate our model.

w

m

v

w

s

d

a

m

i

d

W

i

w

a

q

r

d

d

this purpose, when a neutral expression was present, we penal-

ized the distribution by dividing the probabilities by 2. This still

produced an attention factor to the neutral expression, but in a

smaller intensity than the one present in the happy expression. Fi-

nally, when both expressions were present, the probability distri-

bution was split into 1/3 to the neutral expression and 2/3 to the

happy expression.

In our attention experiments, we perform three different se-

tups: one with only the face expression, one with only the body

motion and one with both. For the face expression experiments,

we pre-processed the data by cropping the face out of the expres-

sions. For the body movement experiments, we remove the face

of the expression and we let the whole expression for the experi-

ments with both stimuli. We performed each experiment 10 times

with a random selection of 70% for each category data for train-

ing and 30% for testing. That means that a total of 280 images per

category were used for training the model and 120 for testing.

We used a top-20 accuracy rate measurement. This error rate

describes if the top-20 activations in one feed-forward step match

ith the top-20 values of the teaching signal. We counted how

any of the top-20 activations match and then averaged it. The

alue of 20 activations translates to 20 pixels of precision, which

as the mean of the size of expression images. To help us under-

tand how the model learns to detect expressions, we applied a

e-convolution process to visualize the features of each expression

nd analyzed the output of the network for different inputs.

Finally, for our modulation experiment we use the attention

odel as the modulator for our CCCNN. We evaluate the network

n an emotion recognition task using the 10 classes of the FABO

ataset and split the data into 30% for testing and 70% for training.

e perform the experiments 30 times, each time randomly choos-

ng the samples for training and testing, and calculate the mean

eighted accuracy and standard deviation. The weighted accuracy

llows us to take into consideration the different number of se-

uences per class in the FABO dataset, calculating the accuracy in

elation to the number of samples of each class. This is a stan-

ard evaluation measure used in other works when using the same

ataset, and we followed the same guidelines as in [7] .


Table 2

Reported top-20 accuracy, in percentage, and standard deviation for

different modalities and numbers of expressions presented to the

model.

– Top-20 accuracy

No expression 75.5% (3.2)

Face Body movement Both

One expression 87.7% (1.5) 85.0% (2.3) 93.4% (3.3)

Two expressions 76.4% (2.7) 68.2% (1.4) 84.4% (3.4)

Table 3

Reported accuracy, in percentage, and standard devi-

ation, in parentheses, for the visual stream channels

of the CCCNN trained with the FABO corpus with and

without attention modulation.

Class Without attention With attention

Anger 95.9 (1.3) 95.7 (1.1)

Anxiety 91.2 (1.6) 95.4 (0.7)

Uncertainty 86.4 (1.1) 92.1 (1.3)

Boredom 92.3 (1.7) 90.3 (0.8)

Disgust 93.2 (1.8) 93.3 (1.6)

Fear 94.7 (0.6) 94.5 (0.7)

Happiness 98.8 (0.2) 98.0 (0.3)

Surprise 94.6 (0.6) 97.2 (0.2)

Puzzlement 88.7 (1.2) 93.2 (0.8)

Sadness 99.8 (0.1) 99.5 (0.3)

Mean 93.65 (1.0) 95.13 (0.7)

5

i

m

t

t

h

t

t

I

t

b

2

b

c

h

r

c

s

m

t

c

U

w

i

t

t

w

t

p

p

a

r

t

Table 4

Comparison of the accuracies, in percentage, of

our model with state-of-the-art approaches re-

ported with the FABO corpus.

Approach Accuracy (%)

[2] 91.3

[7] 75.0

[20] 82.5

CCCNN 93.65

Attention CCCNN 95.13

C

r

t

s

a

a

w

t

d

m

F

t

w

e

s

s

F

f

m

c

c

r

t

c

o

e

6

t

T

t

i

h

c

i

i

p

n

m

e

o

d

i

o

m

p

o

s

l

.3. Results

Our results for the attention model are shown in Table 2 . It

s possible to see that when no expressions were present, the

odel top-20 error was 75.5%. When no expression was present,

he model should output a distribution filled with 0 s, showing

hat no expression was presented.

After evaluating the model with one expression, we can see

ow the system perform based on each modality condition. If only

he face was present, the top-20 accuracy was 87.7%. In this case,

he face expression was effective in identifying a point of interest.

n this scenario, the use of happy or neutral face expressions led

o the strong hypothesis of interest regions. When evaluating only

ody movement, we see that there was a slight drop in the top-

0 accuracy rate, which became 85.0%. This was probably caused

y the presence of the movements in the image, without identifi-

ation or clarification of which movement was performed (either

appy or neutral). Finally, we can see that the best results were

eached by the combination of both cues, reaching a top-20 ac-

uracy of 93.4% and showing that the integration of face expres-

ion and body movement was more accurate than processing the

odalities alone.

Presenting two expressions of two persons being displayed at

he same time to the network caused a drop of the attention pre-

ision while making clearer the relevance of using both modalities.

sing only face expressions reached a top-20 accuracy of 76.4%

hile using only body movement had 68.2%. In this scenario, the

dentification of the type of expression became important. The dis-

inction between a happy and a neutral expression was crucial, and

he face expression became more dominant in this case. However,

hen presenting both modalities at the same time, the model ob-

ained 84.4% of top-20 accuracy. This shows that both modalities

erformed better in distinguishing between happy and neutral ex-

ressions compared to using one modality alone.

The second round of experiments deals with the use of the

ttention modulation for emotion recognition. Table 3 shows the

esults obtained while training the attention modulation recogni-

ion model with the FABO corpus in comparison with the common

CCNN architecture. It is possible to see that the mean general

ecognition rate increased from 93.65% to 95.13% with the use of

he attention modulation. Although some of the expressions pre-

ented higher recognition, expressions such as “Boredom”, “Fear”

nd “Happiness” presented slightly smaller accuracy. This is prob-

bly due to hand-over-face movements or very slight movements,

hich were determined by the CCCNN but ruled out by the at-

ention mechanism. Training the attention mechanism with more

ata, even with more different subjects, would provide the model

ore capacity to achieve even higher accuracies.

Comparing our model to state-of-the-art approaches using the

ABO corpus shows that our network performed similar, and in

he Face representation better. Table 4 shows this comparison. The

orks of Chen et al. [7] and Gunes and Piccardi [20] extract sev-

ral landmark features from the face, and diverse movement de-

criptors for the body movement. They create a huge feature de-

criptor for each modality and use techniques as SVM and Random

orest, respectively, for classification. It is possible to see that the

usion of both modalities improved their results, but the perfor-

ance is still lower than ours. In previous work, we used a Multi-

hannel Convolution Neural Networks (MCCNN) [2] , to extract fa-

ial and movement features. That network produces a joint rep-

esentation, but our current CCCNN improved this representation

hrough the use of separated channels per modality and the appli-

ation of inhibitory fields. One can see a substantial improvement

n the movement representation, mostly because we use a differ-

nt movement representation in the CCCNN.

. Discussions

Emotional processes are not only influenced by sensory percep-

ion but also elicit adaptive responses and modulate perception.

he use of emotional attention is a method to create an adap-

ive attentional mechanism to effectively mitigate the limitations

ntroduced by the limited capacity of processing. The human brain

as this capacity to guide sensory processing through competing

hoices. Driver [9] suggested that this guiding process occurs dur-

ng perceptual processing, i.e., before the actual sensory process-

ng starts. Selective attention mechanisms are influenced by several

rocesses, including emotional ones [10,23,35] .

Our model uses convolutional neural networks, and as such,

eeds a large amount of data to train. Our attention network has

any connections to train, and thus demands more computational

ffort than the CCCNN. However, as many deep learning models,

nce the model is trained and able to generalize well, it does not

emand a high computational cost for recognizing expressions. Us-

ng a Core I5 computer with a graphic card Nvidia Quadro K620,

ne training epoch of our CCCNN with attention took around 4.8

inutes, while one forward pass took 0.4 s. Table 5 exhibits a com-

arison between the 1 epoch training time and one forward pass

f the CCCNN with and without the attention module, both in the

ame machine.

Deep neural networks usually have many parameters to be se-

ected, and thus, a significant large search space when tuning the


Table 5

Comparison of the time used to train the model (per epoch),

and one single recognition between the attention model and the

recognition model.

Approach 1 train epoch (min) Forward pass (s)

CCCNN 1 0.3

Attention CCCNN 4.8 0.4

Fig. 8. Example of the output of the model, when one happy expression was pre-

sented. The input image values were multiplied by the network output values along

the corresponding axes resulting in the illustrated region of interest.

Fig. 9. Example of the output of the model, when two expressions, happy and neu-

tral, were presented.

Fig. 10. Example of the output of the model, when two happy expressions were

presented. The emotion which produced the higher activation peak gets primarily

highlighted in the region of interest.

w

t

n

p

b

f

o

f

e

w

s

f

model. We proceed to use previous exploratory studies in the pa-

rameters for our convolution neural networks for facial and move-

ment expression recognition [2] for the determination of our pro-

posed topology. We tune the parameters of our attention model

using an empirical search, following the same experimental setup

found in our experiments above. An optimal parameter search

could be performed and it could achieve better results, however,

the main purpose of this paper is to introduce the model and eval-

uate how efficient it is for the proposed task.

Additionally, we also evaluated the capability of dealing with

different combinations of stimuli source. In this section, we discuss

these two aspects and show an analysis of the output distribution

of our attention model.

6.1. Emotional attention mechanism

One aspect of our architecture is how it behaves when different

combinations of stimuli are presented. Our teaching signal deter-

mines the locations where expressions are displayed, but it also

provides information which expression to focus on. We showed

this in a scenario with only one expression being displayed ( Fig. 8 ).

We can see how the model detected the expression on the x - and

y -axes, and how the shape of the output distribution changes de-

pending on the size of the area displaying an expression.

In a second example, we used two expressions as input: one

happy and one neutral expression. The output of the network

changed in this scenario, and instead of a clear bump, it produced

a wider distribution with a smaller bump showing where the neu-

tral expression was located. Fig. 9 illustrates this effect: the net-

work correctly focuses on the happy expression. Lastly, we evalu-

ated a scenario where two happy expressions were presented, with

one being more expressive than the other. In this case, the net-

ork chose to focus on the most expressive one. Fig. 10 illustrates

his process. It is possible to see on the predicted X -axis that the

etwork has two clear bumps, but a bigger one for the most ex-

ressive expression. On the Y -axis, the network produced a larger

ump, large enough to cover both expressions.

Although we do not provide experiments with more than two

aces, we show that our model can focus on the most expressive

ne, which is our main goal. If we would expand the number of

aces in the input without training the model to deal with it, we

xpect that our model accuracy drops. As our network is trained

ith a probability distribution, it can detect faces with different

izes, as soon as the training samples have a variety of different

ace sizes.


7

e

v

c

c

m

w

t

b

t

h

C

r

p

c

f

n

o

E

m

a

e

b

e

d

h

t

t

i

b

e

t

p

t

A

D

F

t

v

a

g

p

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

. Conclusion and future work

In this paper, we used a deep neural architecture to learn

motional attention. Our learning architecture is based on Con-

olutional Neural Networks, using the filtering capability of the

onvolution layers to learn the location of emotional expressions

onveyed by two visual cues: face expressions and body move-

ents. Different from traditional CNN-based classification tasks,

e trained our model with probability distributions that indicated

he focus on attention. This resulted in a model that uses unla-

eled expressions to identify attention regions, being able to dis-

inguish between emotional and neutral expressions. We also show

ow this model can be used as an attention modulator for a

ross-channel Convolution Neural Network (CCCNN), improving its

ecognition capabilities.

Our network was applied in a localization scenario, where ex-

ressions are shown and the network should select where to fo-

us. When different expressions are shown, our network is able to

ocus the attention on the most expressive one. When two non-

eutral expressions are present, the network is able to detect both

f them, but still has an attention peak for only one of them.

xperimental results showed that our model learns how to filter

ulti-cue information with neurons that react to face expressions

nd others that react to body movements emerge without been

xplicitly defined.

Our approach aims at selective attention modulated by affective

ehavior, with evidence suggesting that emotional salience influ-

nces the attention mechanism. Our network is built using a top-

own mechanism which simulates the selective attention in the

uman brain, where emotional expressions attract more attention

han neutral expressions.

Obtained results motivate the extension of our current architec-

ure in different directions. For instance, the integration of auditory

nformation as a perceptual cue would give the model a more ro-

ust multimodal mechanism for emotional attention. Therefore, we

xpect that the addition of auditory information would increase

he precision of the model and approximate our computational ap-

roach to neurobiologically-motivated neural mechanisms for mul-

imodal integration and attention.

cknowledgments

The authors thank Dr. Hatice Gunes for providing the FABO

atabase. This work was partially supported by CAPES Brazilian

ederal Agency for the Support and Evaluation of Graduate Educa-

ion ( p.n.5951–13–5 ), the DAAD German Academic Exchange Ser-

ice for the Cognitive Assistive Systems project ( Kz:A/13/94748 ),

nd the German Research Foundation DFG under project Transre-

io Crossmodal Learning ( TRR 169 ), and the Cross-modal Learning

roject under the Hamburg Landesforschungsförderung.

eferences

[1] R.C. Arkin , M. Fujita , T. Takagi , R. Hasegawa , An ethological and emotional basis

for human–robot interaction, Robot. Auton. Syst. 42 (3) (2003) 191–201 . [2] P. Barros , D. Jirak , C. Weber , S. Wermter , Multimodal emotional state recog-

nition using sequence-dependent deep hierarchical features, Neural Netw. 72

(2015) 140–151 . [3] P. Barros , C. Weber , S. Wermter , Emotional expression recognition with a

cross-channel convolutional neural network for human–robot interaction, in:Proceedings of the Fifteenth International Conference on Humanoid Robots

(Humanoids) IEEE-RAS, IEEE, 2015, pp. 582–587 . [4] J. Bauer , S. Magg , S. Wermter , Attention modeled as information in learning

multisensory integration, Neural Netw. 65 (2015) 44–52 . [5] J. Bullier , Integrated model of visual processing, Brain Res. Rev. 36 (2) (2001)

96–107 .

[6] G. Castellano, L. Kessous, G. Caridakis, Emotion recognition through multi-ple modalities: face, body gesture, speech, in: C. Peter, R. Beale (Eds.), Affect

and Emotion in Human–Computer Interaction, Lecture Notes in Computer Sci-ence, vol. 4 86 8, Springer Berlin Heidelberg, 2008, pp. 92–103, doi: 10.1007/

978- 3- 540- 85099- 1 _ 8 .

[7] S. Chen, Y. Tian, Q. Liu, D.N. Metaxas, Recognizing expressions from face andbody gesture by temporal normalized motion and appearance features, Image

Vis. Comput. 31 (2) (2013) 175–185. http://dx.doi.org/10.1016/j.imavis.2012.06. 014 .

[8] A . Coates , A .Y. Ng , Selecting receptive fields in deep networks, in: Proceedingsof the 24th International Conference on Neural Information Processing Sys-

tems, Curran Associates Inc, 2011, pp. 2528–2536 . [9] J. Driver , A selective review of selective attention research from the past cen-

tury, Br. J. Psychol. 92 (1) (2001) 53–78 .

[10] J.D. Eastwood , D. Smilek , P.M. Merikle , Differential attentional guidance byunattended faces expressing positive and negative emotion, Percept. Psy-

chophys. 63 (6) (2001) 1004–1013 . [11] P. Ekman , Lie catching and microexpressions, The philosophy of deception, Ox-

ford University Press on Demand, 2009, pp. 118–133 . [12] P. Ekman , W.V. Friesen , Constants across cultures in the face and emotion, J.

Personal. Soc. Psychol. 17 (2) (1971) 124–129 .

[13] E. Fox, Processing emotional facial expressions: the role of anxiety and aware-ness, Cognit. Affect. Behav. Neurosci. 2 (1) (2002) 52–63, doi: 10.3758/CABN.2.

1.52 . [14] Y. Fregnac , C. Monier , F. Chavane , P. Baudot , L. Graham , Shunting inhibition, a

silent step in visual cortical computation, J. Physiol. (2003) 441–451 . [15] K. Fukushima, Neocognitron: a self-organizing neural network model for a

mechanism of pattern recognition unaffected by shift in position, Biol. Cybern.

36 (4) (1980) 193–202, doi: 10.10 07/BF0 0344251 . [16] X. Glorot , A. Bordes , Y. Bengio , Deep sparse rectifier neural networks, in:

G.J. Gordon, D.B. Dunson (Eds.), Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics (AISTATS-11), vol. 15, 2011,

pp. 315–323 . Journal of Machine Learning Research – Workshop and Confer-ence Proceedings.

[17] S. Grossberg , Neural Networks and Natural Intelligence, MIT Press, Cambridge,

MA , USA , 1992 . [18] Y. Gu , X. Mai , Y.-j. Luo , Do bodily expressions compete with facial expressions?

time course of integration of emotional signals from the face and the body,PLoS One 8 (7) (2013) 736–762 .

[19] H. Gunes, M. Piccardi, A bimodal face and body gesture database for automaticanalysis of human nonverbal affective behavior, in: Proceedings of the Eigh-

teenth International Conference on Pattern Recognition (ICPR), vol. 1, 2006,

pp. 1148–1153, doi: 10.1109/ICPR.2006.39 . 20] H. Gunes, M. Piccardi, Automatic temporal segment detection and affect recog-

nition from face and body display, IEEE Trans. Syst. Man Cybern. Part B Cybern.39 (1) (2009) 64–84, doi: 10.1109/TSMCB.2008.927269 .

[21] S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human ac-tion recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231,

doi: 10.1109/TPAMI.2012.59 .

22] Q. Jin, C. Li, S. Chen, H. Wu, Speech emotion recognition with acoustic andlexical features, in: Proceedings of the 2015 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4749–4753, doi: 10.1109/ICASSP.2015.7178872 .

23] S. Kastner , L.G. Ungerleider , Mechanisms of visual attention in the human cor-tex, Ann. Rev. Neurosci. 23 (1) (20 0 0) 315–341 .

24] R.J. Krauzlis, L.P. Lovejoy, A. Zénon, Superior colliculus and visual spa-tial attention., Ann. Rev. Neurosci. 36 (1) (2013) 165–182, doi: 10.1146/

annurev-neuro-062012-170249 .

25] H. Li , Z. Lin , X. Shen , J. Brandt , G. Hua , A convolutional neural network cascadefor face detection, in: Proceedings of the 2015 IEEE Conference on Computer

Vision and Pattern Recognition, 2015, pp. 5325–5334 . 26] Y.-L. Lin , G. Wei , Speech emotion recognition based on HMM and SVM, in:

Proceedings of the 2005 International Conference on Machine Learning andCybernetics, vol. 8, IEEE, 2005, pp. 4 898–4 901 .

[27] E.A. Phelps , S. Ling , M. Carrasco , Emotion facilitates perception and potentiates

the perceptual benefits of attention., Psychol. Sci. 17 (4) (2006) 292–299 . 28] J.A. Russell , Is there universal recognition of emotion from facial expressions?

a review of the cross-cultural studies., Psychol. Bull. 115 (1) (1994) 102 . 29] D. Speck , P. Barros , C. Weber , S. Wermter , Ball localization for RoboCup soc-

cer using convolutional neural networks, in: Proceedings of the 2016 RoboCupInternational Symposium (RoboCup’2016), 2016 .

30] M. Ursino , C. Cuppini , E. Magosso , Neurocomputational approaches to mod-

elling multisensory integration in the brain: a review, Neural Netw. 60 (2014)141–165 .

[31] P. Viola , M.J. Jones , Robust real-time face detection, Int. J. Comput. Vis. 57 (2)(2004) 137–154 .

32] P. Vuilleumier , How brains beware: neural mechanisms of emotional attention,Trends Cognit. Sci. 9 (12) (2005) 585–594 .

33] P. Vuilleumier, S. Schwartz, Emotional facial expressions capture attention,

Neurology 56 (2) (2001) 153–158, doi: 10.1212/WNL.56.2.153 . 34] G. Wallis , E. Rolls , P. Földiák , Learning invariant responses to the natural trans-

formations of objects, in: Proceedings of the 1993 International Joint Confer-ence on Neural Networks, 1993, pp. 1087–1090 .

35] J.M.G. Williams , A. Mathews , C. MacLeod , The emotional Stroop task and psy-chopathology., Psychol. Bull. 120 (1) (1996) 3 .

36] L. Wiskott, T.J. Sejnowski, Slow feature analysis: unsupervised learn-

ing of invariances, Neural Comput. 14 (4) (2002) 715–770, doi: 10.1162/089976602317318938 .

[37] M.D. Zeiler , R. Fergus , Visualizing and understanding convolutional networks,in: Proceedings of the 2014 European Conference on Computer Vision Com-

puter vision (ECCV), Springer, 2014, pp. 818–833 .

http://dx.doi.org/10.13039/501100002322

http://dx.doi.org/10.13039/501100001655

http://dx.doi.org/10.13039/501100001659

http://refhub.elsevier.com/S0925-2312(17)30455-1/sbref0001




















http://dx.doi.org/10.1007/978-3-540-85099-1_8

http://dx.doi.org/10.1016/j.imavis.2012.06.014















http://dx.doi.org/10.3758/CABN.2.1.52







http://dx.doi.org/10.1007/BF00344251












http://dx.doi.org/10.1109/ICPR.2006.39

http://dx.doi.org/10.1109/TSMCB.2008.927269

http://dx.doi.org/10.1109/TPAMI.2012.59

http://dx.doi.org/10.1109/ICASSP.2015.7178872




http://dx.doi.org/10.1146/annurev-neuro-062012-170249






























http://dx.doi.org/10.1212/WNL.56.2.153









http://dx.doi.org/10.1162/089976602317318938





c

a

i

n

a

S

Pablo Barros received his Bachelor’s degree in Informa-

tion Systems from the Federal Rural University of Per-nambuco and his Master’s degree in Computer Engineer-

ing from the University of Pernambuco, both in Brazil.

Since 2013, he is a research associate and Ph.D. candi-date in the Knowledge Technology Group at the Univer-

sity of Hamburg, Germany, where he is part of the re-search project CML (Cross-Modal Learning). His main re-

search interests include deep learning and neurocognitivesystems for multimodal emotional perception and learn-

ing, human–robot interaction and cross-modal neural

architectures.

German I. Parisi received his Bachelor’s and Master’s de-

gree in Computer Science from the University of Milano-Bicocca, Italy. Since 2013, he is a research associate in the

Knowledge Technology Group at the University of Ham-

burg, Germany, where he is part of the research projectCASY (Cognitive Assistive Systems) and the international

Ph.D. research training group CINACS (Cross-Modal Inter-action in Natural and Artificial Cognitive Systems). His

main research interests include neurocognitive systemsfor human–robot assistance, computational models of the

visual cortex, and bio-inspired action and gesture recog-

nition.

Cornelius Weber is Lab Manager at the Knowledge Tech-

nology group at the University of Hamburg. He graduatedin physics, Bielefeld, Germany, and received his Ph.D. in

computer science at the Technische Universität Berlin in

20 0 0. Then he was a postdoctoral fellow in Brain andCognitive Sciences, University of Rochester, USA. From

20 02 to 20 05 he was a research scientist in Hybrid In-telligent Systems, University of Sunderland, UK. Then, Ju-

nior Fellow at the Frankfurt Institute for Advanced Stud-ies, Frankfurt am Main, Germany, until 2010. His core in-

terest is computational neuroscience with foci on vision,

unsupervised learning and reinforcement learning.

Stefan Wermter received the Diploma from the Univer-

sity of Dortmund, the M.Sc. from the University of Mas-sachusetts, and the Ph.D. and Habilitation from the Uni-

versity of Hamburg, all in Computer Science. He has beena visiting research scientist at the International Computer

Science Institute in Berkeley before leading the Chair in

Intelligent Systems at the University of Sunderland, UK.Currently He is Full Professor in Computer Science at the

University of Hamburg and Director of the KnowledgeTechnology institute. His main research interests are in

the fields of neural networks, hybrid systems, cognitiveneuroscience, bio-inspired computing, cognitive robotics

and natural language processing. In 2014 he was generalhair for the International Conference on Artificial Neural Networks (ICANN). He is

lso on the current board of the European Neural Network Society, associate ed-

tor of the journals Transactions on Neural Networks and Learning Systems, Con-ection Science, International Journal for Hybrid Intelligent Systems and Knowledge

nd Information Systems and he is on the editorial board of the journals Cognitiveystems Research and Journal of Computational Intelligence.

emotion-modulated attention improves expression ...€¦ · our model consists of a deep...

Documents