strong and simple baselines for multimodal utterance...

Post on 22-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Strong and Simple Baselines for Multimodal Utterance Embeddings

Paul Pu Liang*, Yao Chong Lim*, Yao-Hung Hubert Tsai,Ruslan Salakhutdinov and Louis-Philippe Morency

Human Language is often multimodal

Language• Word choice• Syntax• Pragmatics

Visual• Facial expressions• Body language• Eye contact• Gestures

Acoustic• Tone• Prosody• Phrasing

Sentiment• Positive/Negative• Intensity

Emotion• Anger• Happiness• Sadness• Confusion• Fear• Surprise

Meaning• Sarcasm• Humor

Human Language is often multimodal

“This movie is great” +

Neutral expression

Sentiment Intensity

Human Language is often multimodal

“This movie is great”

“This movie is great”

+

Neutral expression

Smile

+

Sentiment Intensity

Challenges in Multimodal ML

Challenges in Multimodal ML

1. Intramodal interactions

Smile Head nod+ Smile + Head shakevs.

Challenges in Multimodal ML

1. Intramodal interactions

2. Crossmodal interactions

Smile Head nod+ Smile + Head shakevs.

“This movie is great” SmileBimodal +

Challenges in Multimodal ML

1. Intramodal interactions

2. Crossmodal interactions

Smile Head nod+ Smile + Head shakevs.

(Sarcasm)

“This movie is great” SmileBimodal

“This movie is GREAT” SmileTrimodal “great” is emphasized, drawn-out

+

+ +

Multimodal Language Embedding

“Thisisunbelievable!”

lang

uage

visu

alac

oust

ic

Loud

Utterance Embedding

Downstream Tasks

• Sentiment Analysis• Emotion Recognition• Speaker Trait

Recognition…

Intramodal + crossmodalinteractions

Multimodal Language Embedding

“Thisisunbelievable!”

lang

uage

visu

alac

oust

ic

Loud

Utterance Embedding

Downstream Tasks

• Sentiment Analysis• Emotion Recognition• Speaker Trait

Recognition…

Intramodal + crossmodalinteractions

Why fast models?

• Applications• Robots, virtual agents, intelligent personal assistants• Processing large amounts of multimedia data

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Performance

Speed

Current SOTAOur goal

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Performance

Speed

Current SOTAOur goal

Our models:• Fewer parameters• Has a closed-form solution• Linear functions• Competitive with SOTA!

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'Word embeddings

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'

𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)

Word embeddings

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'

𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)

Fast: No learnable parameters.

Word embeddings

MMB1: Representing intramodal interactions

MMB1: Representing intramodal interactionsUtterance

embedding 𝑚&

Words

𝑤" 𝑤# 𝑤' 𝑤1…

It doesn’t give help

(Arora et al)

MMB1: Representing intramodal interactions

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1

Gaussian parameters

Gaussian parameters

Visual Audio

Utterance-level feature distributions:

MMB1: Representing intramodal interactions

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1

Linear transformations

MMB1: Representing intramodal interactions

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Words

𝑤" 𝑤# 𝑤' 𝑤1…

It doesn’t give help

(Arora et al)

Small number of additional parameters!

Linear transformations

Crossmodal interactions

“It didn’t help” +

Neutral face

“It didn’t help” +

Sad face

Stable voice

Shaky voice

+

+

Disappointment

Sadness

Emotion

MMB2: Incorporating crossmodal interactions

Utterance embedding 𝑚&

W+A

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

V+A

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

W+V

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

W+V+A

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

Concatenated inputs

MMB2: Incorporating crossmodal interactions

Utterance embedding 𝑚&

W+A

𝜇;6 𝜎;6

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

V+A

𝜇36 𝜎36

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

W+V

𝜇;3 𝜎;3

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

W+V+A

𝜇;36 𝜎;36

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

MMB2: Incorporating crossmodal interactions

Utterance embedding 𝑚&

W+A

𝜇;6 𝜎;6

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

V+A

𝜇36 𝜎36

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

W+V

𝜇;3 𝜎;3

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

W+V+A

𝜇;36 𝜎;36

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

Linear transformations

How do we optimize the model?

Visual

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

Words𝑤" 𝑤# 𝑤' 𝑤1…

Coordinate ascent-style

How do we optimize the model?

Visual

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

Words𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration: Coordinate ascent-style

How do we optimize the model?

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&

Coordinate ascent-style

How do we optimize the model?

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

Words𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&2. Fix 𝑚&, update transformation parameters by gradient

descent

Coordinate ascent-style

Visual

Datasets

CMU-MOSI (Zadeh et al. 2016)• Multimodal Sentiment Analysis dataset• 2199 English opinion segments (monologues) from online

videos

I thought it was funLanguage

Visual

Acoustic (elongation)(emphasis)

Datasets

POM (Park et al., 2014)• Multimodal Speaker Traits Recognition• 903 English videos annotated for speaker traits such as

confidence, dominance, vividness, relaxed, nervousness, humor etc.

Compared Models

Deep neural models• Early Fusion: EF-LSTM• DF (Nojavanasghari et al., 2016)• Multi-view Learning: MV-LSTM (Rajagopalan et al., 2016)• Contextual LSTM: BC-LSTM (Poria et al., 2017)• Tensor Fusion: TFN (Zadeh et al., 2017)• Memory Fusion: MFN (Zadeh et al., 2018)

Experiments

69

70

71

72

73

74

75

76

77

78

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

Bina

ry A

ccur

acy

(%)

CMU-MOSI Sentiment

74.675.1

77.4

Deep neural models

Our baselines

Experiments

0.710.720.730.740.750.760.770.780.790.8

0.810.82

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

MA

E

POM Speaker Traits Recognition

Deep neural models

Our baselines

0.774

0.746

0.785

Speed Comparisons

0.001

0.01

0.1

1

10

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

Average Inference Time (s)

Deep neural models

Our baselines

Conclusion

• Proposed two simple but strong baselines for learning embeddings of multimodal utterances• Try strong baselines before working on complicated models!

The End!

72

73

74

75

76

77

78

100 1000 10000 100000 1000000

CMU

-MO

SI A

ccur

acy

(%)

Inferences per second

Deep neural models

Our baselines

Github: yaochie/multimodal-baselinesEmail:pliang@cs.cmu.eduyaochonl@cs.cmu.edu

Additional Results

Experiments

0.4

0.45

0.5

0.55

0.6

0.65

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

Corr

elat

ion

CMU-MOSI Sentiment

Deep neural models

Our baselines

Experiments

69

70

71

72

73

74

75

76

77

78

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

F1 s

core

CMU-MOSI Sentiment

Deep neural models

Our baselines

Experiments

20

22

24

26

28

30

32

34

36

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

7-cl

ass

Acc

urac

y (%

)

CMU-MOSI Sentiment

Deep neural models

Our baselines

Experiments

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

MA

E

CMU-MOSI Sentiment

Deep neural models

Our baselines

top related