Strong and Simple Baselines for Multimodal Utterance...
Transcript of Strong and Simple Baselines for Multimodal Utterance...
Strong and Simple Baselines for Multimodal Utterance Embeddings
Paul Pu Liang*, Yao Chong Lim*, Yao-Hung Hubert Tsai,Ruslan Salakhutdinov and Louis-Philippe Morency
Human Language is often multimodal
Language• Word choice• Syntax• Pragmatics
Visual• Facial expressions• Body language• Eye contact• Gestures
Acoustic• Tone• Prosody• Phrasing
Sentiment• Positive/Negative• Intensity
Emotion• Anger• Happiness• Sadness• Confusion• Fear• Surprise
Meaning• Sarcasm• Humor
Human Language is often multimodal
“This movie is great” +
Neutral expression
Sentiment Intensity
Human Language is often multimodal
“This movie is great”
“This movie is great”
+
Neutral expression
Smile
+
Sentiment Intensity
Challenges in Multimodal ML
Challenges in Multimodal ML
1. Intramodal interactions
Smile Head nod+ Smile + Head shakevs.
Challenges in Multimodal ML
1. Intramodal interactions
2. Crossmodal interactions
Smile Head nod+ Smile + Head shakevs.
“This movie is great” SmileBimodal +
Challenges in Multimodal ML
1. Intramodal interactions
2. Crossmodal interactions
Smile Head nod+ Smile + Head shakevs.
(Sarcasm)
“This movie is great” SmileBimodal
“This movie is GREAT” SmileTrimodal “great” is emphasized, drawn-out
+
+ +
Multimodal Language Embedding
“Thisisunbelievable!”
lang
uage
visu
alac
oust
ic
Loud
Utterance Embedding
Downstream Tasks
• Sentiment Analysis• Emotion Recognition• Speaker Trait
Recognition…
Intramodal + crossmodalinteractions
Multimodal Language Embedding
“Thisisunbelievable!”
lang
uage
visu
alac
oust
ic
Loud
Utterance Embedding
Downstream Tasks
• Sentiment Analysis• Emotion Recognition• Speaker Trait
Recognition…
Intramodal + crossmodalinteractions
Why fast models?
• Applications• Robots, virtual agents, intelligent personal assistants• Processing large amounts of multimedia data
Research Question
Can we make principled but simple models for multimodal utterance embeddings that perform competitively?
Research Question
Can we make principled but simple models for multimodal utterance embeddings that perform competitively?
Performance
Speed
Current SOTAOur goal
Research Question
Can we make principled but simple models for multimodal utterance embeddings that perform competitively?
Performance
Speed
Current SOTAOur goal
Our models:• Fewer parameters• Has a closed-form solution• Linear functions• Competitive with SOTA!
A language-only solution
Arora et al. (2016, 2017):
𝑤" 𝑤# 𝑤$
Sentenceembedding𝑚&
This manual is helpful
𝑤'Word embeddings
A language-only solution
Arora et al. (2016, 2017):
𝑤" 𝑤# 𝑤$
Sentenceembedding𝑚&
This manual is helpful
𝑤'
𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)
Word embeddings
A language-only solution
Arora et al. (2016, 2017):
𝑤" 𝑤# 𝑤$
Sentenceembedding𝑚&
This manual is helpful
𝑤'
𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)
Fast: No learnable parameters.
Word embeddings
MMB1: Representing intramodal interactions
MMB1: Representing intramodal interactionsUtterance
embedding 𝑚&
Words
𝑤" 𝑤# 𝑤' 𝑤1…
It doesn’t give help
(Arora et al)
MMB1: Representing intramodal interactions
Visual𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…
Utterance embedding 𝑚&
Audio𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1
Gaussian parameters
Gaussian parameters
Visual Audio
Utterance-level feature distributions:
MMB1: Representing intramodal interactions
Visual𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…
Utterance embedding 𝑚&
Audio𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1
Linear transformations
MMB1: Representing intramodal interactions
Visual𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…
Utterance embedding 𝑚&
Audio𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…
Words
𝑤" 𝑤# 𝑤' 𝑤1…
It doesn’t give help
(Arora et al)
Small number of additional parameters!
Linear transformations
Crossmodal interactions
“It didn’t help” +
Neutral face
“It didn’t help” +
Sad face
Stable voice
Shaky voice
+
+
Disappointment
Sadness
Emotion
MMB2: Incorporating crossmodal interactions
Utterance embedding 𝑚&
W+A
…[𝑤", 𝑎"] [𝑤1, 𝑎1]
V+A
…[𝑣", 𝑎"] [𝑣1, 𝑎1]
W+V
…[𝑤", 𝑣"] [𝑤1, 𝑣1]
Unimodal
W+V+A
…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]
Concatenated inputs
MMB2: Incorporating crossmodal interactions
Utterance embedding 𝑚&
W+A
𝜇;6 𝜎;6
…[𝑤", 𝑎"] [𝑤1, 𝑎1]
V+A
𝜇36 𝜎36
…[𝑣", 𝑎"] [𝑣1, 𝑎1]
W+V
𝜇;3 𝜎;3
…[𝑤", 𝑣"] [𝑤1, 𝑣1]
Unimodal
W+V+A
𝜇;36 𝜎;36
…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]
MMB2: Incorporating crossmodal interactions
Utterance embedding 𝑚&
W+A
𝜇;6 𝜎;6
…[𝑤", 𝑎"] [𝑤1, 𝑎1]
V+A
𝜇36 𝜎36
…[𝑣", 𝑎"] [𝑣1, 𝑎1]
W+V
𝜇;3 𝜎;3
…[𝑤", 𝑣"] [𝑤1, 𝑣1]
Unimodal
W+V+A
𝜇;36 𝜎;36
…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]
Linear transformations
How do we optimize the model?
Visual
𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…Audio
𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…
Utterance embedding 𝑚&
Words𝑤" 𝑤# 𝑤' 𝑤1…
Coordinate ascent-style
How do we optimize the model?
Visual
𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…Audio
𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…
Utterance embedding 𝑚&
Words𝑤" 𝑤# 𝑤' 𝑤1…
Two steps each iteration: Coordinate ascent-style
How do we optimize the model?
𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…
𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…
Utterance embedding 𝑚&
𝑤" 𝑤# 𝑤' 𝑤1…
Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&
Coordinate ascent-style
How do we optimize the model?
𝜇3 𝜎3
𝑣" 𝑣# 𝑣' 𝑣1…Audio
𝜇6 𝜎6
𝑎" 𝑎# 𝑎' 𝑎1…
Utterance embedding 𝑚&
Words𝑤" 𝑤# 𝑤' 𝑤1…
Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&2. Fix 𝑚&, update transformation parameters by gradient
descent
Coordinate ascent-style
Visual
Datasets
CMU-MOSI (Zadeh et al. 2016)• Multimodal Sentiment Analysis dataset• 2199 English opinion segments (monologues) from online
videos
I thought it was funLanguage
Visual
Acoustic (elongation)(emphasis)
Datasets
POM (Park et al., 2014)• Multimodal Speaker Traits Recognition• 903 English videos annotated for speaker traits such as
confidence, dominance, vividness, relaxed, nervousness, humor etc.
Compared Models
Deep neural models• Early Fusion: EF-LSTM• DF (Nojavanasghari et al., 2016)• Multi-view Learning: MV-LSTM (Rajagopalan et al., 2016)• Contextual LSTM: BC-LSTM (Poria et al., 2017)• Tensor Fusion: TFN (Zadeh et al., 2017)• Memory Fusion: MFN (Zadeh et al., 2018)
Experiments
69
70
71
72
73
74
75
76
77
78
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
Bina
ry A
ccur
acy
(%)
CMU-MOSI Sentiment
74.675.1
77.4
Deep neural models
Our baselines
Experiments
0.710.720.730.740.750.760.770.780.790.8
0.810.82
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
MA
E
POM Speaker Traits Recognition
Deep neural models
Our baselines
0.774
0.746
0.785
Speed Comparisons
0.001
0.01
0.1
1
10
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
Average Inference Time (s)
Deep neural models
Our baselines
Conclusion
• Proposed two simple but strong baselines for learning embeddings of multimodal utterances• Try strong baselines before working on complicated models!
The End!
72
73
74
75
76
77
78
100 1000 10000 100000 1000000
CMU
-MO
SI A
ccur
acy
(%)
Inferences per second
Deep neural models
Our baselines
Github: yaochie/multimodal-baselinesEmail:[email protected]@cs.cmu.edu
Additional Results
Experiments
0.4
0.45
0.5
0.55
0.6
0.65
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
Corr
elat
ion
CMU-MOSI Sentiment
Deep neural models
Our baselines
Experiments
69
70
71
72
73
74
75
76
77
78
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
F1 s
core
CMU-MOSI Sentiment
Deep neural models
Our baselines
Experiments
20
22
24
26
28
30
32
34
36
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
7-cl
ass
Acc
urac
y (%
)
CMU-MOSI Sentiment
Deep neural models
Our baselines
Experiments
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2
MA
E
CMU-MOSI Sentiment
Deep neural models
Our baselines