strong and simple baselines for multimodal utterance...

Strong and Simple Baselines for Multimodal Utterance Embeddings

Paul Pu Liang*, Yao Chong Lim*, Yao-Hung Hubert Tsai,Ruslan Salakhutdinov and Louis-Philippe Morency

Human Language is often multimodal

Language• Word choice• Syntax• Pragmatics

Visual• Facial expressions• Body language• Eye contact• Gestures

Acoustic• Tone• Prosody• Phrasing

Sentiment• Positive/Negative• Intensity

Emotion• Anger• Happiness• Sadness• Confusion• Fear• Surprise

Meaning• Sarcasm• Humor

“This movie is great” +

Neutral expression

Sentiment Intensity

“This movie is great”

Neutral expression

Sentiment Intensity

Challenges in Multimodal ML

1. Intramodal interactions

Smile Head nod+ Smile + Head shakevs.

2. Crossmodal interactions

“This movie is great” SmileBimodal +

2. Crossmodal interactions

(Sarcasm)

“This movie is great” SmileBimodal

“This movie is GREAT” SmileTrimodal “great” is emphasized, drawn-out

Multimodal Language Embedding

“Thisisunbelievable!”

Utterance Embedding

Downstream Tasks

• Sentiment Analysis• Emotion Recognition• Speaker Trait

Recognition…

Intramodal + crossmodalinteractions

Multimodal Language Embedding

“Thisisunbelievable!”

Utterance Embedding

Downstream Tasks

• Sentiment Analysis• Emotion Recognition• Speaker Trait

Recognition…

Intramodal + crossmodalinteractions

Why fast models?

• Applications• Robots, virtual agents, intelligent personal assistants• Processing large amounts of multimedia data

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Research Question

Performance

Current SOTAOur goal

Research Question

Performance

Current SOTAOur goal

Our models:• Fewer parameters• Has a closed-form solution• Linear functions• Competitive with SOTA!

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'Word embeddings

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)

Word embeddings

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)

Fast: No learnable parameters.

Word embeddings

MMB1: Representing intramodal interactions

MMB1: Representing intramodal interactionsUtterance

embedding 𝑚&

𝑤" 𝑤# 𝑤' 𝑤1…

It doesn’t give help

(Arora et al)

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1

Gaussian parameters

Visual Audio

Utterance-level feature distributions:

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1

Linear transformations

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

𝑤" 𝑤# 𝑤' 𝑤1…

It doesn’t give help

(Arora et al)

Small number of additional parameters!

Crossmodal interactions

“It didn’t help” +

Neutral face

“It didn’t help” +

Sad face

Stable voice

Shaky voice

Disappointment

Sadness

Emotion

MMB2: Incorporating crossmodal interactions

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

Concatenated inputs

𝜇;6 𝜎;6

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

𝜇36 𝜎36

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

𝜇;3 𝜎;3

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

𝜇;36 𝜎;36

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

𝜇;6 𝜎;6

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

𝜇36 𝜎36

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

𝜇;3 𝜎;3

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

𝜇;36 𝜎;36

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

How do we optimize the model?

Visual

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Words𝑤" 𝑤# 𝑤' 𝑤1…

Coordinate ascent-style

Visual

𝜇3 𝜎3

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Two steps each iteration: Coordinate ascent-style

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&

𝜇3 𝜎3

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&2. Fix 𝑚&, update transformation parameters by gradient

descent

Visual

Datasets

CMU-MOSI (Zadeh et al. 2016)• Multimodal Sentiment Analysis dataset• 2199 English opinion segments (monologues) from online

videos

I thought it was funLanguage

Visual

Acoustic (elongation)(emphasis)

Datasets

POM (Park et al., 2014)• Multimodal Speaker Traits Recognition• 903 English videos annotated for speaker traits such as

confidence, dominance, vividness, relaxed, nervousness, humor etc.

Compared Models

Deep neural models• Early Fusion: EF-LSTM• DF (Nojavanasghari et al., 2016)• Multi-view Learning: MV-LSTM (Rajagopalan et al., 2016)• Contextual LSTM: BC-LSTM (Poria et al., 2017)• Tensor Fusion: TFN (Zadeh et al., 2017)• Memory Fusion: MFN (Zadeh et al., 2018)

Experiments

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

CMU-MOSI Sentiment

74.675.1

Deep neural models

Our baselines

Experiments

0.710.720.730.740.750.760.770.780.790.8

0.810.82

POM Speaker Traits Recognition

Deep neural models

Our baselines

Speed Comparisons

Average Inference Time (s)

Deep neural models

Our baselines

Conclusion

• Proposed two simple but strong baselines for learning embeddings of multimodal utterances• Try strong baselines before working on complicated models!

The End!

100 1000 10000 100000 1000000

Inferences per second

Deep neural models

Our baselines

Github: yaochie/multimodal-baselinesEmail:pliang@cs.cmu.eduyaochonl@cs.cmu.edu

Additional Results

Experiments

CMU-MOSI Sentiment

Deep neural models

Our baselines

Experiments

CMU-MOSI Sentiment

Deep neural models

Our baselines

Experiments

CMU-MOSI Sentiment

Deep neural models

Our baselines

Experiments

CMU-MOSI Sentiment

Deep neural models

Our baselines

strong and simple baselines for multimodal utterance...

Documents

multimodal sentiment analysis with word-level fusion and...

learning representations from imperfect time series data...

learning dependency-based compositional...

mmb1 - blue cross blue shield association · plan mmb1...

intermodal and intramodal competition in passenger rail...

· mmb1 no r 3 peteh0 kohk com i-iko 3a cqet...

semi-supervised learning for natural...

aerolíneas de bajo costo en méxico - scieloaerolíneas de...

mmb1 lec2 qb2013

chapter...

fission yeast mitochondria are distributed by dynamic...

fission yeast mitochondria are distributed by dynamic...

kmmivso.com · 2020. 7. 15. · c mmb1 n uh òumej7bhoao...

idei report # 9 rail...

mmb1 lecture 1: introduction to biotechnology

scaling abstraction reﬁnement via...

zapolyarny-adm.yanao.ru · o a,mmb1 (þhhahchpobahug...

dopamine receptors mediate strategy abandoning via ... ·...

environment-driven lexicon induction for high-level...

bmc evolutionary biology biomed central2017. 8. 29. · 8...