multi-modal sequence fusion via recursive attention for ... · movement of face using distances...

9
Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 251–259 Brussels, Belgium, October 31 - November 1, 2018. c 2018 Association for Computational Linguistics 251 Multi-modal Sequence Fusion via Recursive Attention for Emotion Recognition Rory Beard 1,* Ritwik Das 2,* Raymond W. M. Ng 1,* P. G. Keerthana Gopalakrishnan 2 Luka Eerens 2 Pawel Swietojanski 1 Ondrej Miksik 1 1 Emotech Labs 2 Carnegie Mellon University Abstract Natural human communication is nuanced and inherently multi-modal. Humans possess spe- cialised sensoria for processing vocal, visual, and linguistic, and para-linguistic information, but form an intricately fused percept of the multi-modal data stream to provide a holistic representation. Analysis of emotional content in face-to-face communication is a cognitive task to which humans are particularly attuned, given its sociological importance, and poses a difficult challenge for machine emulation due to the subtlety and expressive variability of cross-modal cues. Inspired by the empirical success of re- cent so-called End-To-End Memory Net- works (Sukhbaatar et al., 2015), we propose an approach based on recursive multi-attention with a shared external memory updated over multiple gated iterations of analysis. We eval- uate our model across several large multi- modal datasets and show that global contextu- alised memory with gated memory update can effectively achieve emotion recognition. 1 Introduction Multi-modal sequential data pose interesting chal- lenges for learning machines that seek to derive representations. This constitutes an increasingly relevant sub-field of multi-view learning (Ngiam et al., 2011; Baltrusaitis et al., 2017). Exam- ples of such modalities include visual, audio and textual data. Uni-modal observations are typi- cally complementary to each other and hence they can reveal a fuller and more context-rich pic- ture with better generalisation ability when used together. Through its complementary perspec- tive, each view can unburden sub-modules spe- cific to another modality of some of its modelling onus, which might otherwise learn implicit hidden * Equal contribution. causes that are over-fitted to training data idiosyn- crasies in order to explain the training labels. On the other hand, multi-modal data introduces many difficulties to model designing and train- ing due to the distinct inherent dynamics of each modality. For instance, combining modalities with different temporal resolution is an open problem. Other challenges include deciding where and how modalities are combined, leveraging the weak dis- criminative power of training label and the pres- ence of variability and noise or dealing with com- plex situations such as modelling the emotion of sarcasm, where cues among modalities contradict. In this paper, we address multi-modal sequence fusion for automatic emotion recognition. We be- lieve, that a strong model should enable: (i) Specialisation of modality-specific sub- modules exploiting the inherent properties of its data stream, tapping into the mode-specific dynamics and characteristic patterns. (ii) Weak (soft) data alignment dividing heteroge- neous sequences into segments with co-occuring events across modalities without alignment to a common time axis. This overcomes limitations of hard alignments which often introduce spurious modelling assumptions and data inefficiencies (e.g. re-sampling) which must be performed again from scratch if views are added or removed. (iii) Information exchange for both view-specific information and statistical strength for learning shared representations. (iv) Scalability of the approach to many modal- ities using (a) parallelisable computation over modalities, and (b) a parameter set size growing (at most) linearly with the number of modalities. In the present work, we detail a recursively attentive modelling approach. Our model ful- fills the desiderata above and performs multiple sweeps of globally-contextualised analysis so that one modality-specific representation cues the at-

Upload: others

Post on 22-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 251–259Brussels, Belgium, October 31 - November 1, 2018. c©2018 Association for Computational Linguistics

251

Multi-modal Sequence Fusion via Recursive Attentionfor Emotion Recognition

Rory Beard1,∗ Ritwik Das2,∗ Raymond W. M. Ng1,∗ P. G. Keerthana Gopalakrishnan2

Luka Eerens2 Pawel Swietojanski1 Ondrej Miksik1

1Emotech Labs 2Carnegie Mellon University

Abstract

Natural human communication is nuanced andinherently multi-modal. Humans possess spe-cialised sensoria for processing vocal, visual,and linguistic, and para-linguistic information,but form an intricately fused percept of themulti-modal data stream to provide a holisticrepresentation. Analysis of emotional contentin face-to-face communication is a cognitivetask to which humans are particularly attuned,given its sociological importance, and poses adifficult challenge for machine emulation dueto the subtlety and expressive variability ofcross-modal cues.

Inspired by the empirical success of re-cent so-called End-To-End Memory Net-works (Sukhbaatar et al., 2015), we proposean approach based on recursive multi-attentionwith a shared external memory updated overmultiple gated iterations of analysis. We eval-uate our model across several large multi-modal datasets and show that global contextu-alised memory with gated memory update caneffectively achieve emotion recognition.

1 Introduction

Multi-modal sequential data pose interesting chal-lenges for learning machines that seek to deriverepresentations. This constitutes an increasinglyrelevant sub-field of multi-view learning (Ngiamet al., 2011; Baltrusaitis et al., 2017). Exam-ples of such modalities include visual, audio andtextual data. Uni-modal observations are typi-cally complementary to each other and hence theycan reveal a fuller and more context-rich pic-ture with better generalisation ability when usedtogether. Through its complementary perspec-tive, each view can unburden sub-modules spe-cific to another modality of some of its modellingonus, which might otherwise learn implicit hidden

∗ Equal contribution.

causes that are over-fitted to training data idiosyn-crasies in order to explain the training labels.

On the other hand, multi-modal data introducesmany difficulties to model designing and train-ing due to the distinct inherent dynamics of eachmodality. For instance, combining modalities withdifferent temporal resolution is an open problem.Other challenges include deciding where and howmodalities are combined, leveraging the weak dis-criminative power of training label and the pres-ence of variability and noise or dealing with com-plex situations such as modelling the emotion ofsarcasm, where cues among modalities contradict.

In this paper, we address multi-modal sequencefusion for automatic emotion recognition. We be-lieve, that a strong model should enable:(i) Specialisation of modality-specific sub-modules exploiting the inherent properties ofits data stream, tapping into the mode-specificdynamics and characteristic patterns.(ii) Weak (soft) data alignment dividing heteroge-neous sequences into segments with co-occuringevents across modalities without alignment to acommon time axis. This overcomes limitations ofhard alignments which often introduce spuriousmodelling assumptions and data inefficiencies(e.g. re-sampling) which must be performed againfrom scratch if views are added or removed.(iii) Information exchange for both view-specificinformation and statistical strength for learningshared representations.(iv) Scalability of the approach to many modal-ities using (a) parallelisable computation overmodalities, and (b) a parameter set size growing(at most) linearly with the number of modalities.

In the present work, we detail a recursivelyattentive modelling approach. Our model ful-fills the desiderata above and performs multiplesweeps of globally-contextualised analysis so thatone modality-specific representation cues the at-

Page 2: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

252

tention of the next and vice-versa. We evalu-ate our approach on three large-scale multi-modaldatasets to verify its suitability.

2 Related work

2.1 Multi-modal analysis

Most approaches to multi-modal analysis (Ngiamet al., 2011) focus on designing feature repre-sentations, co-learning mechanisms to transfer in-formation between modalities, and fusion tech-niques to perform a prediction or classification.These models typically perform either “early” (in-put data are concatenated and pushed through acommon model) or “late” (outputs of the last layerare combined together through linear or non-linearweighting) fusion. In contrast, our model does notfall into any of these categories directly as it is “it-erative” in the sense that there are multiple fusionsper decision, with an evolving belief state – thememory. In addition to that, our model is also “ac-tive” since feature extraction from one modalitycan influence the nature of the feature extractionfrom another modality in the next time step via theshared memory.

For instance, Kim et al. (2013) used low-level hand crafted features such as pitch, energyand mel-frequency filter banks (MFBs) capturingprosodic and spectral acoustic information and Fa-cial Animation Parameters (FAP) describing themovement of face using distances between faciallandmarks. In contrast, our model allows for anend-to-end training of feature representation.

Zhang et al. (2017) learnt motion cues in videosusing 3D-CNNs from both spatial and temporaldimensions. They performed deep multi-modalfusion using a deep belief network that learntnon-linear relations across modalities and thenused a linear SVM to classify emotions. Simi-larly, Vielzeuf et al. (2017) explored VGG-LSTMand 3DCNN-LSTM architectures and introduced aweighted score to prioritise the most relevant win-dows during learning. In our approach, exchangeof information between different modalities is notlimited to the last layer of the model, but due tomemory component, each modality can influenceevery other in the following time steps.

Co-training and co-regularisation approaches ofmulti-view learning (Xu et al., 2013; Sindhwaniand Niyogi, 2005) seek to leverage unlabelled datavia a semi-supervised loss that encodes a con-sensus and complementarity principles. The for-

mer encodes the assertion that predictions madebe each view-specific learner should largely agree,and the latter encodes the assumption that eachview contains useful information that is hiddenfrom others, until exchange of information is al-lowed to occur.

2.2 Memory Networks

End-To-End Memory Networks (Sukhbaatar et al.,2015) represent a fully differentiable alternativeto the strong supervision-dependent Memory Net-works (Weston, 2017). To bolster attention-basedrecurrent approaches to language modelling andquestion answering, they introduced a mechanismperforming multiple hops of updates to a “mem-ory” representation to provide context for nextsweep of attention computation.

Dynamic Memory Networks (DMN) (Xionget al., 2016) integrate an attention mechanism witha memory module and multi-modal bilinear pool-ing to combine features across views and predictattention over images for visual question answer-ing task. Nam et al. (2017) iterated on this de-sign to allow the memory update mechanism toreason over previous dual-attention outputs, in-stead of forgetting this information, in the subse-quent sweep. The present work extends the multi-attention framework to leverage neural-based in-formation flow control by dynamically routing itwith neural gating mechanisms.

The very recent work (Zadeh et al., 2018a) alsoapproaches multi-view learning with recourse to asystem of recurrent encoders and attention medi-ated by global memory fusion. However, fusiontakes place at the encoder cell level, requires hardalignment, and is performed online in one sweepso it cannot be informed by upstream context. Theanalysis window of the global memory is limitedto the current and previous cell memories of eachLSTM encoder, whereas our approach abstractsthe shared memory update dynamics away fromthe ties of the encoding dynamics. Therefore ourapproach enables post-fusion and retrospective re-analysis of the entire cell memory history of allencoders at each analysis iteration.

3 Recursive Recurrent Neural Networks

Our approach is tailored to videos of single speak-ers, each divided into segments that roughly spanone uttered sentence. We treat each segment asan independent datum constituting an individual

Page 3: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

253

multi-modal event with its own annotation, suchthat there is no temporal dependence across anytwo segments. In the following exposition, eachof the various mechanisms we describe (encoding,attention, fusion, and memory update) act on eachsegment in isolation of all others. We will use theterms “view” and “modality” interchangeably.

We refer to our recursively attentive analysismodel as a Recursive Recurrent Neural Network(RRNN) since it resembles an RNN, but the hid-den state and the next cell input are coupled in arecursion. At each step of the cell update thereis no new incoming information; rather the sameoriginal inputs are re-weighted by a new attentionquery to form the new cell inputs (see discussionin Section 3.5 for more details).

3.1 Independent recurrent encodingThe major modelling assumption herein, is thata single, independent recurrent encoding of eachsegment of each modality is sufficient to cap-ture a range of semantic representations that canbe tapped by several shared external memoryqueries. Each memory query is formed in a sepa-rate stage of an iterated analysis over the recurrentcodes. Concretely, modality-specific attention-weighted summaries (a(τ),v(τ), t(τ)) at analysisiteration τ contribute to the update of a shareddense memory/context vector m(τ), which in turnserves as a differentiable attention query at iter-ation τ + 1 (cf. Fig. 1). This provides a recur-sive mechanism for sharing information withinand across sequences, so the recurrent represen-tations of one view can be revisited in light ofcross-modal cues gleaned from previous sweeps ofother views. This is an efficient alternative to re-encoding each view on every sweep, and is moremodular and generalisable than routing informa-tion across views at the recurrent cell level.

For each multi-modal sequence segment xn ={xna ,xnv ,xnt }, a view-specific encoding is realisedvia a set of independent bi-directional LSTMs(Hochreiter and Schmidhuber, 1997), run oversegments n ∈ [1, N ]:

hfwds [n, ks] = LSTM(xns [ks],hfwds [n, ks − 1])

(1)

hbwds [n, ks] = LSTM(xns [ks],hbwds [n, ks + 1])

(2)

hs[n, ks] = [hfwds [n, ks];hbwds [n, ks]] (3)

Here, s ∈ {a, v, t} denotes respectively audio, vi-

Visual Attention

Audio Attention

Textual Attention

ha hv ht

m

a v t

W

τ τ τ

mτ + 1

τ

τ

Figure 1: Schematic overview of the proposed neuralarchitecture. Shared memory mτ is updated with withthe contextualised embeddings from aτ , vτ and tτ .

sual and textual modalities, and ks ∈ {1, ...,Ks}are view-specific state indices.

The number of recurrent steps is view-specific(i.e. Ka 6= Kv 6= Kt) and is governed by thefeature representation and sampling rate for thegiven view, e.g. number of word (embeddings) in athe text contained within a time-stamped segment.This is in contrast to Zadeh et al. (2018a), wherethe information in different views was groundedto a common time axis or the number of steps inan early stage, either via up-sampling or down-sampling. Thus the extracted representations inour approach preserve the inherent time-scales ofeach modality and avoid the need for hard align-ment, satisfying desiderata (i) and (ii) outlined inSection 1. Note that the input sequences x(n)

s mayrefer to either raw or pre-processed data (see Sec-tion 4 for details). In the remainder, we drop thesegment id n to reduce notational clutter.

3.2 Globally-contextualised attention

We used a view-specific attention-based weightingmechanism to compute a contextualised embed-ding cs for a view s. Encoder output hs is stackedalong time to form matrices Hs ∈ R(D×Ks).A shared dense memory m(τ=0) is initialised bysumming the time-average of the Hs across threemodalities. M(τ) is then constructed by repeatingthe shared memory, m(τ) , Ks times such that ithas the same size as the corresponding context Hs,

Page 4: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

254

i.e. Hs,M ∈ R(D×Ks). An alignment functionthen scores how well Hs and M(τ) are matched

α(τ)s = align(Hs,M

(τ)). (4)

The alignment mechanism entails a feedforwardneural network with Hs and M(τ) as inputs. Asoftmax is applied on the network output to derivethe attention strength α. This architecture resem-bles that in Bahdanau et al. (2014); concretely

R(τ)=tanh(W

(τ)s1 Hs

)� tanh

(W

(τ)s2 M(τ)

), (5)

α(τ)s =w

(τ)s3

TR(τ), (6)

α(τ)s [ks] =

α(τ)s [ks]∑l α

(τ)s [l]

. (7)

In Eq. (5), W(τ)s (where s∈{s1, s2}) are square or

fat matrices in the first layer of the alignment net-work, containing parameters governing the self-influence within view s and influence from theshared memory M. For the majority of our ex-periments, we used the multiplicative method ofNam et al. (2017) to combine the two activa-tion terms, but similar results were also obtainedwith the concatenative approach of Bahdanau et al.(2014). In eq. (6), w(τ)

s3T is a vector projecting an

un-normalised attention weight R onto an align-ment vector α, which has the same dimensions asKs. Finally, eq. (7) applies the softmax operationalong the time step ks.

Parameters Ws1,Ws2,ws3 for deriving atten-tion strength αs are in general distinct parametersfor each memory update step, τ . However, theycould also be tied across steps. In the standardattention schemes, attention weight αs is a vec-tor spanning across Ks. Note, that w(τ)

s3 in eq. (6)could be replaced by a matrix-form W

(τ)s3 to pro-

duce a multi-head attention weight (Vaswani et al.,2017). Alternatively, the transposition of networkinputs can be performed such that attention scaleseach dimension, D, instead of each time step k.This can be seen as a variant of key-value attention(Daniluk et al., 2017), where the values differ fromtheir keys by a linear transformation with weightsgoverned by the alignment scores.

Each globally-contextualised view representa-tion cs is defined as the convex combination of theview-specific encoder outputs weighted by atten-tion strength

c(τ+1)s =

∑k

α(τ)s [ks]hs[ks]. (8)

3.3 Shared memory update

The previous section described how the currentshared memory state is used to modulate theattention-based re-analysis of the (encoded) in-puts. Here we detail how the outcome of the re-analysis is used to update the shared memory state.

In contrast to the memory update employed inNam et al. (2017), our approach includes a set ofcoupled gating mechanisms outlined below, anddepicted schematically in Fig. 2:

g(τ)w = σ

(Wwmm

(τ−1)+Wwww(τ) + bw

)(9)

g(τ)c = σ

(Wcmm

(τ−1)+Wcww(τ) + bc

)(10)

g(τ)s = σ

(Wsmm

(τ−1)+Wssc(τ)s + bs

)∀s ∈ {a, v, t} (11)

u(τ) = tanh(Wumm

(τ−1)+

Wuwg(τ)w �w(τ) + bu

)(12)

m(τ) =(1− g(τ)

c

)�m(τ−1) + g(τ)

c � u(τ),(13)

where w(τ) = [a(τ);v(τ); t(τ)], m(0) = 0 and σ()denotes an element-wise sigmoid non-linearity.The function of the view context gate defined ineq. (9) and invoked in eq. (12), is to block cor-rupted or uninformative view segments from influ-encing the proposed shared memory update con-tent, u(τ). The attention mechanism, outlined ineq. (5)-(7), cannot fulfill this task alone since thefull attention divided over a view segment mustsum to 1 even if no part of that segment is per-tinent/salient. The utility of this gating will beempirically demonstrated in noise-injection exper-iments in Section 5.The new memory content u(τ) is written to thememory state according to eq. (12), subject to theaction of the memory update gate defined in eq.(10). This update gate determines how much ofthe past global information should be passed onto contextualise subsequent stages of re-analysis.If parameters Ws1, Ws2, ws3 are untied acrosseach re-analysis step, this update gate addition-ally accommodates short-cut or “highway” routing(Srivastava et al., 2015) of regression error gra-dients from the end of the multi-hop procedureback through the parameters of the earlier atten-tion sweeps.

Page 5: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

255

Figure 2: A detailed schematic of the proposed RRNN cell (left) and its legend (right). The routing above thedashed black line resembles that of a (non-recursive) GRU cell, where the concatenated attention output constitutesthe cell’s input. In this case, the cell’s input at time τ is available only once the cell’s state at time τ − 1 has beencomputed. When the static representations {ha,hv,ht} are instead viewed as the cell’s input, then the cell formsa recursive RNN, which subsumes the attention mechanism as a cell sub-component.

ha hv ht

m —1τ m τ m +1τ

Figure 3: Two consecutive cells of a Recursive Recur-rent Neural Network. Note that the cells share a com-mon input, in contrast with a typical RNN which has aseparate input to each cell.

3.4 Final Projection

After τ iterations of fusion and re-analysis, the re-sulting memory state m(τ) is passed through a fi-nal fully-connected layer to yield the output cor-responding to a particular task (regression predic-tions or logits in case of classification). In our ex-periments we found that increasing τ yields mean-ingful performance gains (up to τ = 3).

3.5 Recursive RNN: another perspective

The proposed gated memory update correspondsto maintaining an external recurrent cell memorythat is recurrent in the consecutive analysis hops,τ , rather than the actual time-steps of the givenmodality, ks. This allows the relevant memoriesof older hops to persist for use in the subsequentanalysis hops.

The memory update equations (9)-(13) stronglyresemble the GRU cell update (Cho et al., 2014);we treat concatenated view context vectors asthe GRUs inputs, one at each analysis hop, τ .When viewed as a recurrent encoding of inputs{hs}, we refer to this architecture as a recursiverecurrent neural net (RRNN), due to the recursiverelationship between the cell’s recurrent state andthe attention-based re-weighting of the inputs.From this perspective, the attention mechanismforms a sub-component of the RRNN cell.

The key distinction from a typical GRU cell isthat the reset or relevance gate gw in a GRU typ-ically gates the recurrent state (m(τ) in our case),whereas we use it to gate the input, allowing for

Page 6: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

256

uninformative view contexts to be excluded fromthe memory update. Gating the recurrent stateis essential for avoiding vanishing gradients overlong sequences, which is not such a concern forour recursion lengths of ≈ 3. One could of coursereinstate the gating of the recurrent state, shouldrecursions grow to more appreciable lengths.A further distinction is that here the GRU “inputs”(view contexts {a(τ),v(τ), t(τ)} in our case) arecomputed online as the memory state recurs, un-like the standard case where they are data or pre-extracted features available before the RNN beginsto operate. Figure 3 depicts 2 consecutive RRNNcells, illustrating the recycling of the same cell in-puts. Figure 2 shows the details of a single cell,which subsumes the globally-contextualised atten-tion mechanism detailed in Section 3.2.

4 Experimental setup

Datasets. We evaluated our approach onCREMA-D (Cao et al., 2014), RAVDESS (Living-stone and Russo, 2012) and CMU-MOSEI (Zadehet al., 2018b) datasets for multimodal emotionanalysis. The first two datasets provide audioand visual modalities while CMU-MOSEI addsalso text transcriptions. The CREMA-D datasetcontains ∼7400 clips of 91 actors covering 6emotions. The RAVDESS is a speech and songdatabase comprising of ∼7300 files of 24 actorscovering 8 emotional classes (including twocanonical classes for “neutral” and “calm”). TheCMU-MOSEI dataset consists of ∼3300 longclips segmented into ∼23000 short clips. Inaddition to audio and visual data, it containsalso text transcriptions allowing evaluation oftri-modal models.

These datasets are annotated by a continuous-valued vector corresponding to multi-class emo-tion labels. The ground-truth labels were gen-erated by multiple human transcribers with scorenormalisation and agreement analysis. For furtherdetails, refer to respective references.

Test conditions and baselines. Since eachdataset consists of different emotion classificationschema, we trained and evaluated all models sep-arately for each of them. The training was per-formed in an end-to-end manner with L2 loss de-fined over multi-class emotion labels.

To establish a baseline, we evaluated a naiveclassifier predicting the test-set empirical mean in-tensities (with MSE loss function) for each output

regression dimension. Similar baselines were ob-tained for other loss functions by training a modelwith just one parameter per output dimension onthat loss, where the model has an access to thetraining labels but not the training inputs.

Evaluation. For CREMA-D and RAVDESS, wereport the accuracy scores as these datasets containlabels for multiclass classification task.

For CMU-MOSEI, we report the result of the6-way emotion recognition. Recursive models asdescribed in Sec. 3 predicted the 6-dimensionalemotion vectors. Their values represent the emo-tion intensity of the six emotion classes andare continuous-valued. Following Zadeh et al.(2018b), these predictions were evaluated againstthe reference emotions using the criteria of meansquare error (MSE) and mean absolute error(MAE), summing across 6 classes. In addition, anacceptance threshold 0.1 was set for each dimen-sion/emotion, and weighted accuracy (Tong et al.,2017) was computed.

Complementary views across modality. Allexperiments in this paper use independent recur-rent encoding (Sec. 3.1). The encoding schemediffers for every modality. COVAREP (De-gottex et al., 2014) was used for the audiomodality. OpenFace (Amos et al., 2016) andFACET (iMotion, 2017) were used for visual oneand Glove (Pennington et al., 2014) was used forencoding the text features.

Independent recurrent encoding used bi-directional view-specific encoders with 2×128 di-mensional outputs on CREMA-D and RAVDESSand 2 × 512 on CMU-MOSEI. The comple-mentary effects of multiple views from differentmodalities would be illustrated by controlling theavailable input views to different systems.

Attention. Global contextualised attention(GCA) was implemented for the emotion recogni-tion systems. Global and view-specific memorywere projected to the alignment space (Eq. (5)).The attention weights were computed (Eq. (6)-Eq. (7)) and the contextual view representationwas derived (Eq. (8)). For more details, refer toSec. 3.2. The encoder-decoder used a 128 dimen-sional (or 512 for CMU-MOSEI) fully-connectedlayer. A final linear layer mapped the decoderoutput to multi-class targets.

GCA was compared to standard “early” and“late” fusion strategies. In early fusion, encoders

Page 7: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

257

Model Modality AccuracyHuman performance Audio 40.9

COVAREP Features + LSTM Decoder Audio 41.5OpenFace Features + LSTM Decoder Vision 52.5

Human performance Vision 58.2Human performance Vision+Audio 63.6

(OpenFace features + LSTM) + (COVAREP Features + LSTM) + Dual Attention Vision+Audio 65.0

Table 1: Results on the CREMA-D dataset across 8 emotions

Modality Feature Encoder Attention AccuracyAudio COVAREP LSTM Nil 41.25Vision OpenFace LSTM Nil 52.08

Audio + Vision COVAREP, OpenFace LSTM GCA 58.33

Table 2: Results on the RAVDESS dataset across 8 emotions for normal speech mode

1 2 3 4

textaudiovision

Time (seconds)

I think

finances

for

somereasons

has

been

a

very delicate

issue

betweencouples.

Most

Figure 4: Visualisation of view-specific attentionacross time. Attention in the text modality focuses onthe words “very” and “delicate” as cues for emotionrecogntion. Also, the difference in oscillation rates be-tween the audio and visual modalities is noted.

outputs across all views are resampled to theirhighest temporal resolution (i.e. audio, at 100Hz),and resulting (aligned) outputs are concatenatedacross views. We used similar encoder-decoderstructure to one described in Sec 3.2 (Fig. 1), ex-cept that the three parallel blocks for modalitieswere reduced to one. In late fusion, the final-stepencoder outputs from all modalities were inde-pendently processed by 1-layer feed-forward net-works (Sec 3.4) and view-specific multi-class tar-gets were combined using linear weighting.

Memory updates and ablation study. GCAwas enhanced with the extra gating functions(cf. Eq. (9)-(13), Sec. 3.3) . The extended systemwas compared with the GCA system on CMU-MOSEI data. To this end, we perform an abla-tion study using the test data corrupted by additiveGaussian white noise added to the visual modality.

5 Results

Table 1 and 2 show the results of emotion recog-nition on the CREMA-D and RAVDESS datasetrespectively. Audio, visual and the joint use ofbi-modal information were compared using iden-tification accuracy. Models trained on the vi-sual modality consistently outperformed modelsthat use solely audio data. Highest accuracywas achieved when the audio and visual modal-ity were jointly modelled, giving 65% and 58.33%on the two datasets. Interestingly, the joint bi-modal system outperformed human performanceon CREMA-D (Cao et al., 2014) by 1.4%.

On CMU-MOSEI, the errors between the refer-ence and hypothesis six-dimensional emotion vec-tors were computed and the results were shown inTable 3.

The use of visual modality resulted in the lowestmean square error (MSE). Meanwhile, when eval-uated by mean absolute error (MAE) and weightedaccuracy (WA), text modality gave the best perfor-mance. Basic techniques in combining informa-tion among modalities was not very effective, asindicated by the neglible gain in early and late fu-sion model.

Globally contextualised attention (GCA) gavean MSE of 0.4696. Gating on global and view-specific memory updates led to further improve-ments to 0.4691. The improvement in terms ofMAE is even more significant (from 0.9412 to0.8705).

Figure 4 visualises the attention weights in dif-ferent modalities on a CMU-MOSEI test sentence.The x-axis denotes time t and y-axis is the mag-nitude of attention αs(t) in different views s ∈{a, v, t}. The transcribed text was added along-side the attention profile of the textual modalityto align the attention weights with the recording.It can be seen that the GCA emotion recognition

Page 8: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

258

Modality Feature Encoder Attention/Fusion Corruption MSE MAE WAText (T) Word-vec LSTM Nil Nil 0.6326 0.9830 0.5485

Audio (A) COVAREP LSTM Nil Nil 0.6049 1.0562 0.5249Vision (V) FACET LSTM Nil Nil 0.5026 0.9909 0.5476

T+A+V COVAREP, FACET, Word-vec LSTM Early fusion Nil 0.5319 0.7694 0.5188T+A+V COVAREP, FACET, Word-vec LSTM Late fusion Nil 0.5047 0.9825 0.5889T+A+V COVAREP, FACET, Word-vec LSTM GCA Nil 0.4696 0.9412 0.6163T+A+V COVAREP, FACET, Word-vec LSTM GCA Vision 0.5034 0.9920 0.6068T+A+V COVAREP, FACET, Word-vec LSTM GCA + Gating Nil 0.4691 0.8705 0.5765T+A+V COVAREP, FACET, Wrod-vec LSTM GCA + Gating Vision 0.4742 0.8857 0.5688

Table 3: Results on CMU-MOSEI dataset

system was trained to attend dynamically to fea-tures of varying importance across the time, unlikesystems performing early or late fusion. Attentionweights of text modality show a clear jump for thewords “very” and “delicate”. The word “very”,combined with an adjective, is often a strong cueto sentiment analysis, resulting in a spike in atten-tion. The subject in this clip was speaking mostlyin a neutral tone, with a nod and slight frowningtowards the beginning of the sentence. This maycorrespond to the first peak in the attention trajec-tory of visual data. The weight of audio modal-ity exhibited a higher oscillation rate compared tothe counterpart on visual data. COVAREP featureshad 4× higher temporal frequency than FACET.

Finally, we verified contribution of the gatingsystem to the GCA using the corrupted visual data.When the GCA system is used without the gat-ing mechanism, corrupted data results in increasedMSE (from 0.4696 to 0.5034) and MAE (from0.9412 to 0.9920). This is in contrast to the fullsystem with gating (GCA + Gating in Table 3).The system cancels the effects of additive visualnoise, which is evidenced by the small gap in MSE(0.4691 vs 0.4742) and MAE (0.8705 vs 0.8857)between clean and noisy data.

6 Conclusion

We have presented an approach for combining se-quential, heterogeneous data. An external mem-ory state is updated recursively, using globally-contextualised attention over a set of recurrentview-specific state histories. Our model was testedon the challenging tasks of emotion recognitionfrom audio, visual, and textual data on three large-scale datasets. The complementary effect of jointmodelling of emotions using multi-modal datawas consistently shown across experiments withmultiple datasets. Importantly this approach es-chews hard alignment of the data streams, allow-ing view-specific encoders to respect the inher-

ent dynamics of its input sequence. Encoder statehistories are fused into cross-modal features viaan attention mechanism that is modulated by ashared, external memory. The control of infor-mation flow in this fusion is further enhanced byusing a GRU-like gating mechanism, which canpersist shared memory through multiple iterationswhile blocking corrupted or uninformative view-specific features. In future study, it would be inter-esting to investigate more structured fusion opera-tions such as sparse tensor multilinear maps (Ben-younes et al., 2017).

ReferencesBrandon Amos, Bartosz Ludwiczuk, and Mahadev

Satyanarayanan. 2016. Openface: A general-purpose face recognition library with mobile appli-cations. Technical report, CMU-CS-16-118, CMUSchool of Computer Science.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR,abs/1409.0473.

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2017. Multimodal machinelearning: A survey and taxonomy. CoRR,abs/1705.09406.

Hedi Ben-younes, Remi Cadene, Matthieu Cord, andNicolas Thome. 2017. MUTAN: multimodal tuckerfusion for visual question answering. CoRR,abs/1705.06676.

Houwei Cao, David G. Cooper, and Michael K. Keut-mann. 2014. CREMA-D: Crowd-sourced emotionalmultimodal actors dataset. IEEE Transactions onAffective Computing.

Kyunghyun Cho, Bart van Merrienboer, CalarGulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In EMNLP.

Michal Daniluk, Tim Rocktaschel, Johannes Welbl,and Sebastian Riedel. 2017. Frustratingly short at-

Page 9: Multi-Modal Sequence Fusion via Recursive Attention for ... · movement of face using distances between facial landmarks. In contrast, our model allows for an end-to-end training

259

tention spans in neural language modeling. CoRR,abs/1702.04521.

Gilles Degottex, John Kane, Thomas Drugman, TuomoRaitio, and Stefan Scherer. 2014. COVAREP – acollaborative voice analysis repository for speechtechnologies. In ICASSP.

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. In Neural Computation.

iMotion. 2017. Facial expression analysis.

Yelin Kim, Honglak Lee, and Emily Mower Provost.2013. Deep learning for robust feature generation inaudiovisual emotion recognition. In ICASSP.

S. R. Livingstone and F. A. Russo. 2012. The ryer-son audio-visual database of emotional speech andsong (RAVDESS): A dynamic, multimodal set of fa-cial and vocal expressions in north american english.PloS one.

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim.2017. Dual attention networks for multimodal rea-soning and matching. CVPR.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, JuhanNam, Honglak Lee, and Andrew Y. Ng. 2011. Mul-timodal deep learning. In ICML.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. EMNLP.

Vikas Sindhwani and Partha Niyogi. 2005. A co-regularized approach to semi-supervised learningwith multiple views. In Proceedings of the ICMLWorkshop on Learning with Multiple Views.

Rupesh Kumar Srivastava, Klaus Greff, and JurgenSchmidhuber. 2015. Training very deep networks.In NIPS.

Sainbayar Sukhbaatar, arthur szlam, Jason Weston, andRob Fergus. 2015. End-to-end memory networks.In NIPS.

Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combining human traf-ficking with multimodal deep models. ACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, andLukasz Kaiser. 2017. Attention is all you need.NIPS.

Valentin Vielzeuf, Stephane Pateux, and FredericJurie. 2017. Temporal multimodal fusion forvideo emotion classification in the wild. CoRR,abs/1709.07200.

Jason Weston. 2017. Memory networks for recommen-dation. In RecSys.

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visual andtextual question answering. In ICML.

Chang Xu, Dacheng Tao, and Chao Xu. 2013. A sur-vey on multi-view learning. CoRR, abs/1304.5634.

Amir Zadeh, Paul Pu Liang, Navonil Mazumder,Soujanya Poria, Erik Cambria, and Louis-PhilippeMorency. 2018a. Memory fusion network for multi-view sequential learning. CoRR, abs/1802.00927.

Amir Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Sou-janya Poria, Emdund Tong, Erik Cambria, MinghaiChen, and Louis-Philippe Morency. 2018b. Multi-modal language analysis in the wild: CMU-MOSEIdataset and interpretable dynamic fusion graph. InACL.

Shiqing Zhang, Shiliang Zhang, Tiejun Huang, WenGao, and Qi Tian. 2017. Learning Affective Fea-tures with a Hybrid Deep Model for Audio-VisualEmotion Recognition. IEEE Transactions on Cir-cuits and Systems for Video Technology.