Advanced Multimodal Machine Learning
Transcript of Advanced Multimodal Machine Learning
![Page 1: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/1.jpg)
1
Louis-Philippe Morency
AdvancedMultimodal Machine Learning
Lecture 4.1: Recurrent Networks
* Original version co-developed with Tadas Baltrusaitis
![Page 2: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/2.jpg)
2
Administrative Stuff
![Page 3: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/3.jpg)
Upcoming Schedule
▪ First project assignment:
▪ Proposal presentation (10/2 and 10/4)
▪ First project report (Sunday 10/7)
▪ Second project assignment
▪ Midterm presentations (11/6 and 11/8)
▪ Midterm report (Sunday 11/11)
▪ Final project assignment
▪ Final presentation (12/4 & 12/6)
▪ Final report (Sunday 12/9)
![Page 4: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/4.jpg)
Proposal Presentation (10/2 and 10/4)
▪ 5 minutes (about 5-10 slides)
▪ All team members should be involved in the presentation
▪ Will receive feedback from instructors and other students
▪ 1-2 minutes between presentations reserved for written
feedback
▪ Main presentation points
▪ General research problem and motivation
▪ Dataset and input modalities
▪ Multimodal challenges and prior work
▪ You need to submit a copy of your slides (PDF or PPT)
▪ Deadline: Friday 10/5 (on Gradescope)
![Page 5: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/5.jpg)
Project Proposal Report
▪ Part 1 (updated version of your pre-proposal)
▪ Research problem:
▪ Describe and motivate the research problem
▪ Define in generic terms the main computational
challenges
▪ Dataset and Input Modalities:
▪ Describe the dataset(s) you are planning to use for this
project.
▪ Describe the input modalities and annotations available in
this dataset.
![Page 6: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/6.jpg)
Project Proposal Report
▪ Part 2
▪ Related Work:
▪ Include 12-15 paper citations which give an overview of
the prior work
▪ Present in more details the 3-4 research papers most
related to your work
▪ Research Challenges and Hypotheses:
▪ Describe your specific challenges and/or research
hypotheses
▪ Highlight the novel aspect of your proposed research
![Page 7: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/7.jpg)
Project Proposal Report
▪ Part 3
▪ Language Modality Exploration:
▪ Explore neural language models on your dataset (e.g., using Keras)
▪ Train at least two different language models (e.g., using SimpleRNN,
GRU or LSTM) on your dataset and compare their perplexity.
▪ Include qualitative examples of successes and failure cases.
▪ Visual Modality Exploration:
▪ Explore pre-trained Convolutional Neural Networks (CNNs) on your
dataset
▪ Load a pre-existing CNN model trained for object recognition (e.g.,
VGG-Net) and process your test images.
▪ Extract features at different network layers in the network and visualize
them (using t-sne visualization) with overlaid class labels with different
colors.
![Page 8: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/8.jpg)
Lecture Objectives
▪ Word representations & distributional hypothesis
▪ Learning neural representations (e.g., Word2vec)
▪ Language models and sequence modeling tasks
▪ Recurrent neural networks
▪ Backpropagation through time
▪ Gated recurrent neural networks
▪ Long Short-Term Memory (LSTM) model
![Page 9: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/9.jpg)
9
Representing Words:
Distributed Semantics
![Page 10: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/10.jpg)
Possible ways of representing words
Classic binary word representation: [0; 0; 0; 0;….; 0; 0; 1; 0;…; 0; 0]
100,000d vector
Learned word representation: [0,1; 0,0003; 0;….; 0,02; 0.08; 0,05]
300d vector
Given a text corpus containing 100,000 unique words
Only non-zero at the index of the word
Classic word feature representation: [5; 1; 0; 0;….; 0; 20; 1; 0;…; 3; 0]
300d vector
Manually define 300 “good” features (e.g., ends on –ing)
This 300-dimension vector should approximate the
“meaning” of the word
![Page 11: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/11.jpg)
11
▪ Distribution Hypothesis (DH) [Lenci 2008]
▪ At least certain aspects of the meaning of lexical expressionsdepend on their distributional properties in the linguistic contexts
▪ The degree of semantic similarity between two linguisticexpressions and is a function of the similarity of the linguisticcontexts in which and can appear
▪ Weak and strong DH
▪ Weak view as a quantitative method for semantic analysis andlexical resource induction
▪ Strong view as a cognitive hypothesis about the form and origin ofsemantic representations; assuming that word distributions incontext play a specific causal role in forming meaningrepresentations.
The Distributional Hypothesis
11
![Page 12: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/12.jpg)
12
▪ He handed her glass of bardiwac.
▪ Beef dishes are made to complement the bardiwacs.
▪ Nigel staggered to his feet, face flushed from too muchbardiwac.
▪ Malbec, one of the lesser-known bardiwac grapes, respondswell to Australia’s sunshine.
▪ I dined off bread and cheese and this excellent bardiwac.
▪ The drinks were delicious: blood-red bardiwac as well as light,sweet Rhenish.
bardiwac is a heavy red alcoholic beverage made fromgrapes
What is the meaning of “bardiwac”?
12
![Page 13: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/13.jpg)
Geometric interpretation
▪ row vector xdog
describes usage of
word dog in the
corpus
▪ can be seen as
coordinates of point
in n-dimensional
Euclidean space Rn
13Stefan Evert 2010
![Page 14: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/14.jpg)
Distance and similarity
▪ illustrated for two
dimensions: get and
use: xdog = (115, 10)
▪ similarity = spatial
proximity (Euclidean
distance)
▪ location depends on
frequency of noun
(fdog 2.7 · fcat)
Stefan Evert 2010
![Page 15: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/15.jpg)
Angle and similarity
▪ direction more
important than
location
▪ normalise “length”
||xdog|| of vector
▪ or use angle as
distance measure
Stefan Evert 2010
![Page 16: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/16.jpg)
Semantic maps
16
![Page 17: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/17.jpg)
17
Learning Neural
Word Representations
![Page 18: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/18.jpg)
How to learn neural word representations?
Distribution hypothesis: Approximate the
word meaning by its surrounding words
Words used in a similar context will lie close together
He was walking away because …
He was running away because …
Instead of capturing co-occurrence counts directly,
predict surrounding words of every word
![Page 19: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/19.jpg)
x W1 W2 y
[0; 0; 0; 0;….; 0; 0; 1; 0;…; 0; 0] [0; 1; 0; 0;….; 0; 0; 0; 0;…; 0; 0]
[0; 0; 0; 1;….; 0; 0; 0; 0;…; 0; 0]
[0; 0; 0; 0;….; 1; 0; 0; 0;…; 0; 0]
[0; 0; 0; 0;….; 0; 0; 0; 0;…; 0; 1]
walking
He was walking away because …
He was running away because …
He
Was
Away
because
No activation function -> very fast
How to learn neural word representations?
300d 300d
10
0 0
00
d
10
0 0
00
d
Word2vec algorithm: https://code.google.com/p/word2vec/
![Page 20: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/20.jpg)
How to use these word representations
Classic NLP:
Walking: [0; 0; 0; 0;….; 0; 0; 1; 0;…; 0; 0]
Running: [0; 0; 0; 0;….; 0; 0; 0; 0;…; 1; 0]
Goal:
Walking: [0,1; 0,0003; 0;….; 0,02; 0.08; 0,05]
Running: [0,1; 0,0004; 0;….; 0,01; 0.09; 0,05]
Similarity = 0.0
Similarity = 0.9
If we would have a vocabulary of 100 000 words:
100 000 dimensional vector
300 dimensional vector
x W1
300d
100
000d
Transform: x’=x*W
![Page 21: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/21.jpg)
Vector space models of words
While learning these word representations, we are
actually building a vector space in which all words
reside with certain relationships between them
This vector space allows for algebraic operations:
Vec(king) – vec(man) + vec(woman) ≈ vec(queen)
Encodes both syntactic and semantic relationships
Why linear algebra is working?
![Page 22: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/22.jpg)
Vector space models of words: semantic relationships
Trained on the Google news corpus with over 300 billion words
![Page 23: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/23.jpg)
23
Language Sequence
Modeling Tasks
![Page 24: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/24.jpg)
24
Sequence Modeling: Sequence Label Prediction
Sentiment ?(positive or negative)
Prediction
Ideal for anyone with an interest in disguises
Sentiment label?
![Page 25: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/25.jpg)
25
Sequence Modeling: Sequence Prediction
Part-of-speech ?(noun, verb,…)
Prediction
Ideal for anyone with an interest in disguises
POS? POS? POS? POS? POS? POS? POS? POS?
![Page 26: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/26.jpg)
26
Sequence Modeling: Sequence Representation
Sequence representationLearning
Ideal for anyone with an interest in disguises
[0,1; 0,0004; 0;….; 0,01; 0.09; 0,05]
![Page 27: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/27.jpg)
27
Sequence Modeling: Language Model
Language ModelPrediction
Ideal for anyone with
Next word?
an interest in disguises
![Page 28: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/28.jpg)
Application: Speech Recognition
)(
)()|(maxarg
acousticsP
cewordsequenPcewordsequenacousticsP
cewordsequen
Language model
=)|(maxarg acousticscewordsequenPcewordsequen
)()|(maxarg cewordsequenPcewordsequenacousticsPcewordsequen
![Page 29: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/29.jpg)
29
Application: Language Generation
Generation
[0,1;
0,0004;
….;
0.09;
0,05]
Embedding
[0,1;
0,0004;
….;
0.09;
0,05]
Example: Image captioning
![Page 30: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/30.jpg)
N-Gram Language Model Formulations
▪ Word sequences
▪ Chain rule of probability
▪ Bigram approximation
▪ N-gram approximation
n
n www ...11 =
)|()|()...|()|()()( 1
1
1
1
1
2
131211
−
=
−
== kn
k
k
n
n
n wwPwwPwwPwwPwPwP
)|()( 1
1
1
1
−
+−
=
=k
Nk
n
k
k
n wwPwP
)|()( 1
1
1 −
=
= k
n
k
k
n wwPwP
![Page 31: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/31.jpg)
Evaluating Language Model: Perplexity
Perplexity is the inverse probability of the test set, normalized by the number of words:
Chain rule:
For bigrams:
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
PP(W ) = P(w1w2...wN )-
1
N
=1
P(w1w2...wN )N
![Page 32: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/32.jpg)
32
Challenges in Sequence Modeling
▪ Language Model
▪ Sentiment ?(positive or negative)
▪ Part-of-speech ?(noun, verb,…)
Main Challenges:
▪ Sequences of variable lengths (e.g., sentences)
▪ Keep the number of parameters at a minimum
▪ Take advantage of possible redundancy
▪ Sequence representation
Model
![Page 33: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/33.jpg)
Time-Delay Neural Network
Main Challenges:
▪ Sequences of variable lengths (e.g., sentences)
▪ Keep the number of parameters at a minimum
▪ Take advantage of possible redundancy
Ideal for anyone with an interest in disguises
1D Convolution
![Page 34: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/34.jpg)
Neural-based Unigram Language Model (LM)
Neural
Network
P(b|a): not from count, but the NN that can predict the next word.
P(“dog on the beach”)
=P(dog|START)P(on|dog)P(the|on)P(beach|the)
1-of-N encoding
of “START”
P(next word is
“dog”)
Neural
Network
1-of-N encoding
of “dog”
P(next word is
“on”)
Neural
Network
1-of-N encoding
of “on”
P(next word is
“the”)
Neural
Network
1-of-N encoding
of “the”
P(next word is
“beach”)
![Page 35: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/35.jpg)
Neural-based Unigram Language Model (LM)
Neural
Network
P(b|a): not from count, but the NN that can predict the next word.
P(“dog on the beach”)
=P(dog|START)P(on|dog)P(the|on)P(beach|the)
1-of-N encoding
of “START”
P(next word is
“dog”)
Neural
Network
1-of-N encoding
of “dog”
P(next word is
“on”)
Neural
Network
1-of-N encoding
of “on”
P(next word is
“the”)
Neural
Network
1-of-N encoding
of “the”
P(next word is
“beach”)
It does not model sequential
information between predictions.
Recurrent Neural Networks!
![Page 36: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/36.jpg)
36
Recurrent Neural
Networks
![Page 37: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/37.jpg)
Sequence Prediction
x1 x2 x3
y1 y2 y3
(xi are vectors)Input data: 𝑥1 𝑥2 𝑥3 ……
Wi
Wo
Wi Wi
Wo Woh1 h2 h3
(or Unigram Language Model)
(yi are vectors)Output data: 𝑦1 𝑦2 𝑦3 ……
How can we include temporal dynamics?
![Page 38: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/38.jpg)
Elman Network for Sequence Prediction
copy copy
x1 x2 x3
y1 y2 y3
Can be trained using backpropagation
Wi
Wo
Wh WhWhWi Wi
Wo Wo
0h1
a1 a2
h2 h3
The same model parameters are used again and again.
(or Unigram Language Model)
(xi are vectors)Input data: 𝑥1 𝑥2 𝑥3 ……
(yi are vectors)Output data: 𝑦1 𝑦2 𝑦3 ……
![Page 39: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/39.jpg)
39
Recurrent Neural Network
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)𝑽
𝑼
𝐿(𝑡)
𝑦(𝑡)𝒛(𝑡) = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡), 𝑽)
𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙(𝑡))
Feedforward Neural Network
![Page 40: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/40.jpg)
40
Recurrent Neural Networks
𝑾
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)𝑽
𝑼
𝐿(𝑡)
𝑦(𝑡)𝒛(𝑡) = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡), 𝑽)
𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡 +𝑾𝒉(𝑡−1))
𝐿 =
𝑡
𝐿(𝑡)
![Page 41: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/41.jpg)
41
Recurrent Neural Networks - Unrolling
𝒙(1)
𝒛(𝟏)
𝒉(1)𝑽
𝑼
𝐿(1)
𝑦(1)
𝒙(2)
𝒛(2)
𝒉(2)
𝐿(2)
𝑦(2)
𝑾
𝒙(3)
𝒛(3)
𝒉(3)
𝐿(3)
𝑦(3)
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)
𝐿(𝑡)
𝑦(𝑡)𝒛(𝑡) = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡), 𝑽)
𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡 +𝑾𝒉(𝑡−1))
Same model parameters are used for all time parts.
𝐿 =
𝑡
𝐿(𝑡)
![Page 42: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/42.jpg)
RNN-based Language Model
1-of-N encoding
of “START”1-of-N encoding
of “dog”
1-of-N encoding
of “on”1-of-N encoding
of “nice”
➢Models long-term information
P(next word is
“dog”)P(next word is
“on”)P(next word is
“the”)P(next word is
“beach”)
![Page 43: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/43.jpg)
RNN-based Sentence Generation (Decoder)
1-of-N encoding
of “START”1-of-N encoding
of “dog”
1-of-N encoding
of “on”1-of-N encoding
of “the”
➢Models long-term information
P(next word is
“dog”)P(next word is
“on”)P(next word is
“the”)P(next word is
“beach”)
Context
![Page 44: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/44.jpg)
44
Sequence Modeling: Sequence Prediction
Sentiment ?(positive or negative)
Prediction
Ideal for anyone with an interest in disguises
Sentiment label?
![Page 45: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/45.jpg)
RNN for Sequence Prediction
P(word is
positive)
Ideal for anyone disguises
𝐿 =1
𝑁
𝑡
𝐿(𝑡) =1
𝑁
𝑡
−𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
P(word is
positive)
P(word is
positive)
P(word is
positive)
![Page 46: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/46.jpg)
RNN for Sequence Prediction
P(sequence is
positive)
Ideal for anyone disguises
𝐿 = 𝐿(𝑁) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑁)|𝒛(𝑁))
![Page 47: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/47.jpg)
47
Sequence Modeling: Sequence Representation
Sequence representationLearning
Ideal for anyone with an interest in disguises
[0,1; 0,0004; 0;….; 0,01; 0.09; 0,05]
![Page 48: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/48.jpg)
RNN for Sequence Representation
1-of-N encoding
of “START”1-of-N encoding
of “dog”
1-of-N encoding
of “on”1-of-N encoding
of “nice”
P(next word is
“dog”)P(next word is
“on”)P(next word is
“the”)P(next word is
“beach”)
![Page 49: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/49.jpg)
RNN for Sequence Representation (Encoder)
1-of-N encoding
of “START”1-of-N encoding
of “dog”
1-of-N encoding
of “on”1-of-N encoding
of “nice”
Sequence
Representation
![Page 50: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/50.jpg)
RNN-based for Machine Translation
1-of-N encoding
of “le”1-of-N encoding
of “chien”
1-of-N encoding
of “la”1-of-N encoding
of “plage”
1-of-N encoding
of “sur”
Le chien sur la plage The dog on the beach
![Page 51: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/51.jpg)
Encoder-Decoder Architecture
1-of-N encoding
of “le”1-of-N encoding
of “chien”
1-of-N encoding
of “la”1-of-N encoding
of “plage”
1-of-N encoding
of “sur”
Context
What is the loss function?
![Page 52: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/52.jpg)
52
▪ Character-level “language models”▪ Xiang Zhang, Junbo Zhao and Yann LeCun, Character-level
Convolutional Networks for Text Classification, NIPS 2015
http://arxiv.org/pdf/1509.01626v2.pdf
▪ Skip-though: embedding at the sentence level▪ Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S.
Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler. Skip-
Thought Vectors, NIPS 2015
http://arxiv.org/pdf/1506.06726v1.pdf
Related Topics
![Page 53: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/53.jpg)
53
Backpropagation
Through Time
![Page 54: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/54.jpg)
54
Optimization: Gradient Computation
𝒙
𝒉
𝑦𝛻𝒙 𝑦 =𝜕𝑦
𝜕𝑥1,𝜕𝑦
𝜕𝑥2,𝜕𝑦
𝜕𝑥3
Vector representation:
𝑦 = 𝑓(𝒉)
𝒉 = 𝑔(𝒙)𝛻𝒙 𝑦 =
𝜕𝒉
𝜕𝒙
𝑇
𝛻𝒉 𝑦
Gradient
“local” Jacobian“backprop” Gradient
(matrix of size ℎ × 𝑥 computed
using partial derivatives)
![Page 55: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/55.jpg)
55
Backpropagation Algorithm
Forward pass
▪ Following the graph topology,
compute value of each unit
Backpropagation pass
▪ Initialize output gradient = 1
▪ Compute “local” Jacobian matrix
using values from forward pass
▪ Use the chain rule:
Gradient “local” Jacobian
“backprop” gradient
= x𝒙
𝒉𝟏
𝒛 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉2,𝑾𝟑)
𝒉1 = 𝑓(𝒙;𝑾𝟏)
𝒉𝟐 𝒉𝟐 = 𝑓(𝒉𝟏;𝑾𝟐)
𝑾𝟑
𝑾𝟐
𝑾𝟏
𝐿
𝑦
𝐿 = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦|𝒛)(cross-entropy)
![Page 56: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/56.jpg)
56
Recurrent Neural Networks
𝒙(1)
𝒛(𝟏)
𝒉(1)𝑽
𝑼
𝐿(1)
𝑦(1)
𝒙(2)
𝒛(2)
𝒉(2)
𝐿(2)
𝑦(2)
𝑾
𝒙(3)
𝒛(3)
𝒉(3)
𝐿(3)
𝑦(3)
𝒙(𝜏)
𝒛(𝜏)
𝒉(𝜏)
𝐿(𝜏)
𝑦(𝜏)
𝒛(𝑡) = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡), 𝑽)
𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡 +𝑾𝒉(𝑡−1))
𝐿 =
𝑡
𝐿(𝑡)
![Page 57: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/57.jpg)
57
Backpropagation Through Time
𝒙(𝜏)
𝒛(𝜏)
𝒉(𝜏)
𝐿(𝜏)
𝑦(𝜏)
𝜕𝐿
𝜕𝐿(𝑡)
𝐿 =
𝑡
𝐿(𝑡) = −
𝑡
𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
Gradient
“local” Jacobian
“backprop” gradient=
x𝐿(𝑡)
𝒛(𝑡)𝛻𝒛 𝑡 𝐿
𝑖
𝒉(𝜏) 𝛻𝒉 𝜏 𝐿 = 𝛻𝒛 𝜏 𝐿𝜕𝑧(𝜏)
𝜕𝒉(𝜏)= 𝛻𝒛 𝜏 𝐿𝑽
= 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑧𝑖𝑡 − 𝟏𝑖,𝑦(𝑡)
𝛻𝒉 𝑡 𝐿𝒉(𝑡+1)𝒉(𝑡)
𝐿(𝜏) or
or𝒛(𝜏)=
𝜕𝐿
𝜕𝑧𝑖(𝑡)
= 1
=𝜕𝐿
𝜕𝐿(𝑡)𝜕𝐿(𝑡)
𝜕𝑧𝑖(𝑡)
= 𝛻𝒛 𝑡 𝐿𝜕𝒐(𝑡)
𝜕𝒉(𝑡)+ 𝛻𝒛 𝑡+1 𝐿
𝜕𝒉(𝑡+1)
𝜕𝒉(𝑡)
![Page 58: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/58.jpg)
58
Backpropagation Through Time
𝒙(𝜏)
𝒛(𝜏)
𝒉(𝜏)
𝐿(𝜏)
𝑦(𝜏)
𝐿 =
𝑡
𝐿(𝑡) = −
𝑡
𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
Gradient
“local” Jacobian
“backprop” gradient=
x
𝑽
𝑼
𝑾
𝑽 𝛻𝑽𝐿
𝑾 𝛻𝑾𝐿 =
𝑡
𝛻𝒉 𝑡 𝐿𝜕𝒉(𝑡)
𝜕𝑾
𝑼 𝛻𝑼𝐿 =
𝑡
𝛻𝒉 𝑡 𝐿𝜕𝒉(𝑡)
𝜕𝑼
=
𝑡
𝛻𝒛 𝑡 𝐿𝜕𝒛(𝑡)
𝜕𝑽
![Page 59: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/59.jpg)
59
Gated Recurrent
Neural Networks
![Page 60: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/60.jpg)
RNN for Sequence Prediction
P(word is
positive)
Ideal for anyone disguises
𝐿 =
𝑡
𝐿(𝑡) =
𝑡
−𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
P(word is
positive)
P(word is
positive)
P(word is
positive)
![Page 61: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/61.jpg)
RNN for Sequence Prediction
P(sequence is
positive)
Ideal for anyone disguises
𝐿 = 𝐿(𝑁) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑁)|𝒛(𝑁))
![Page 62: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/62.jpg)
62
Recurrent Neural Networks
𝒙(1)
𝒛(𝟏)
𝒉(1)𝑽
𝑼
𝐿(1)
𝑦(1)
𝒙(2)
𝒛(2)
𝒉(2)
𝐿(2)
𝑦(2)
𝑾
𝒙(3)
𝒛(3)
𝒉(3)
𝐿(3)
𝑦(3)
𝒙(𝜏)
𝒛(𝜏)
𝒉(𝜏)
𝐿(𝜏)
𝑦(𝜏)
𝒛(𝑡) = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡), 𝑽)
𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦(𝑡)|𝒛(𝑡))
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡 +𝑾𝒉(𝑡−1))
𝐿 =
𝑡
𝐿(𝑡)
![Page 63: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/63.jpg)
63
Long-term Dependencies
Vanishing gradient problem for RNNs:
➢ The influence of a given input on the hidden layer, and therefore on
the network output, either decays or blows up exponentially as it
cycles around the network's recurrent connections.
𝒉(𝑡)~𝑡𝑎𝑛ℎ(𝑾𝒉(𝑡−1))
![Page 64: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/64.jpg)
64
Recurrent Neural Networks
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)
𝐿(𝑡)
𝑦(𝑡)𝒉(𝑡)
𝒙(𝑡)
tanh+1
-1𝒉(𝑡+1)
![Page 65: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/65.jpg)
65
LSTM ideas: (1) “Memory” Cell and Self Loop
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)
𝐿(𝑡)
𝑦(𝑡)
Self-
𝒉(𝑡)
𝒙(𝑡)
tanh+1
-1𝒄(𝑡)cell
Self-
loop
𝒉(𝑡+1)
Long Short-Term Memory (LSTM)
[Hochreiter and Schmidhuber, 1997]
![Page 66: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/66.jpg)
66
LSTM Ideas: (2) Input and Output Gates
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)
𝐿(𝑡)
𝑦(𝑡)
Self-
𝒉(𝑡)
𝒙(𝑡)
tanh+1
-1𝒄(𝑡)cell
𝒉(𝑡+1)
𝒉(𝑡)
𝒙(𝑡)
sigmoid
+1
0
x
𝒉(𝑡)
𝒙(𝑡)
sigmoid
+1
0
x
Input gate
Output gate
sum
Self-
loop
[Hochreiter and Schmidhuber, 1997]
![Page 67: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/67.jpg)
67
LSTM Ideas: (3) Forget Gate
𝒙(𝑡)
𝒛(𝑡)
𝒉(𝑡)
𝐿(𝑡)
𝑦(𝑡)
Self-
𝒉(𝑡)
𝒙(𝑡)
tanh+1
-1𝒄(𝑡)cell
Self-
loop
𝒉(𝑡+1)
𝒉(𝑡)
𝒙(𝑡)
sigmoid
+1
0
x
𝒉(𝑡)
𝒙(𝑡)
sigmoid
+1
0
𝒉(𝑡)
𝒙(𝑡)
sigmoid
+1
0
x
xInput gate
Forget gate
Output gate
sum
[Gers et al., 2000]
𝒈𝒊𝒇𝒐
=
𝑡𝑎𝑛ℎ𝑠𝑖𝑔𝑚𝑠𝑖𝑔𝑚𝑠𝑖𝑔𝑚
𝑾 𝒉(𝑡)
𝒙(𝑡)
𝒊
𝒈
𝒇
𝒐
𝒄(𝑡) = 𝒇⨀𝒄 𝑡−1 + 𝒊⨀𝒈
𝒉(𝑡) = 𝒐⨀tanh(𝒄 𝑡 )
![Page 68: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/68.jpg)
68
Recurrent Neural Network using LSTM Units
𝒙(1)
𝒛(𝟏)
𝑽
𝑾
𝐿(1)
𝑦(1)
𝒙(2)
𝒛(2)
𝐿(2)
𝑦(2)
𝒙(3)
𝒛(3)
𝐿(3)
𝑦(3)
𝒙(𝜏)
𝒛(𝜏)
𝐿(𝜏)
𝑦(𝜏)
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
Gradient can still be computer using backpropagation!
![Page 69: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/69.jpg)
69
Bi-directional LSTM Network
𝒙(1)
𝒛(𝟏)
𝑽
𝑾𝟐
𝐿(1)
𝑦(1)
𝒙(2)
𝒛(2)
𝐿(2)
𝑦(2)
𝒙(3)
𝒛(3)
𝐿(3)
𝑦(3)
𝒙(𝜏)
𝒛(𝜏)
𝐿(𝜏)
𝑦(𝜏)
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
𝑾𝟏
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
1 1 1 1
2222
![Page 70: Advanced Multimodal Machine Learning](https://reader031.fdocuments.us/reader031/viewer/2022012503/617cfff6382943561830173c/html5/thumbnails/70.jpg)
70
Deep LSTM Network
𝒙(1)
𝒛(𝟏)
𝑽
𝑾𝟐
𝐿(1)
𝑦(1)
𝒙(2)
𝒛(2)
𝐿(2)
𝑦(2)
𝒙(3)
𝒛(3)
𝐿(3)
𝑦(3)
𝒙(𝜏)
𝒛(𝜏)
𝐿(𝜏)
𝑦(𝜏)
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
𝑾𝟏
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
1 1 1 1
2222