DEEP UNSUPERVISED LEARNING
Transcript of DEEP UNSUPERVISED LEARNING
![Page 1: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/1.jpg)
Using RNNs to generate Super Mario Maker levels, Adam Geitgey
COMP547DEEP UNSUPERVISEDLEARNING
Aykut Erdem // Koç University // Spring 2022
Lecture #3 – Neural Networks Basics II: Sequential Processing with NNs
![Page 2: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/2.jpg)
• deep learning
• computation in a neural net
• optimization
• backpropagation
• training tricks
• convolutional neural networks
Previously on COMP547Loss Landscape created with data from the training process of
a convolutional network, Javier Ideami
![Page 3: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/3.jpg)
Good news, everyone!
• Paper list for the paper presentations is out! Each graduate student should select– a paper to provide an overview,– another paper to present either
its strengths or weaknesses.
• Undergraduatestudents will onlysubmit paper reviews.
33
![Page 4: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/4.jpg)
Lecture overview• sequence modeling• recurrent neural networks (RNNs)• language modeling with RNNs• how to train RNNs• long short-term memory (LSTM)• gated recurrent unit (GRU)
• Disclaimer: Much of the material and slides for this lecture were borrowed from —Bill Freeman, Antonio Torralba and Phillip Isola’s MIT 6.869 class—Phil Blunsom’s Oxford Deep NLP class—Fei-Fei Li, Andrej Karpathy and Justin Johnson’s CS231n class—Arun Mallya’s tutorial on Recurrent Neural Networks
4
![Page 5: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/5.jpg)
Sequential data• “I took the dog for a walk this morning.”
•
•
•
5
sentence
medical signals
speech waveform
video frames
;LEX�MW�E�WIUYIRGI#�
Ɣ Ű-�XSSO�XLI�HSK�JSV�E�[EPO�XLMW�QSVRMRK�ű
Ɣ
Ɣ
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
WIRXIRGI
JYRGXMSR
WTIIGL�[EZIJSVQ
SequencesinVision
Sequencesintheinput…
Jumping
Dancing
Fighting
Eating
Running
;LEX�MW�E�WIUYIRGI#�
Ɣ ƈ8LMW�QSVRMRK�-�XSSO�XLI�HSK�JSV�E�[EPO�Ɖ
Ɣ
Ɣ
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
WIRXIRGI
QIHMGEP�WMKREPW
WTIIGL�[EZIJSVQ
Adapted from Harini Suresh
![Page 6: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/6.jpg)
Modeling sequential data• Sample data sequences from a certain distribution
• Generate natural sentences to describe an image
• Activity recognition from a video sequence
6Adapted from Xiaogang Wang
P (x1, . . . , xN )
P (y1, . . . , yM |I)
P (y|x1, . . . , xN )
cuhk
Modeling sequential data
Sample data sequences from a certain distribution
P(x1, . . . , xT )
Generate natural sentences to describe an image
P(y1, . . . , yT |I)
Activity recognition from a video sequence
P(y|x1, . . . , xT )
Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 4 / 48
SequencesinVision
Sequencesintheinput…
Jumping
Dancing
Fighting
Eating
Running
![Page 7: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/7.jpg)
Modeling sequential data• Speech recognition
• Object tracking
7
P (y1, . . . , yN |x1, . . . , xN )
P (y1, . . . , yN |x1, . . . , xN )
;LEX�MW�E�WIUYIRGI#�
Ɣ Ű-�XSSO�XLI�HSK�JSV�E�[EPO�XLMW�QSVRMRK�ű
Ɣ
Ɣ
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
WIRXIRGI
JYRGXMSR
WTIIGL�[EZIJSVQ→ Hey Siri
cuhk
Modeling sequential data
Speech recognitionP(y1, . . . , yT |x1, . . . , xT )
Object trackingP(y1, . . . , yT |x1, . . . , xT )
Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 5 / 48
Adapted from Xiaogang Wang
![Page 8: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/8.jpg)
Modeling sequential data• Generate natural sentences to describe a video
• Machine translation
8
P (y1, . . . , yM |x1, . . . , xN )
P (y1, . . . , yM |x1, . . . , xN )
Video description:Sequence-to-sequence problem
A man is riding a bike.Deep Neural Network
Input Output
… → A man is riding a bike
Adapted from Xiaogang Wang
![Page 9: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/9.jpg)
Convolutions in time
time
9
![Page 11: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/11.jpg)
time
Rufus
11
![Page 12: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/12.jpg)
time
Douglas
12
![Page 13: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/13.jpg)
time
Rufus
Memory unit
13
![Page 14: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/14.jpg)
time
Rufus
Memory unit
Rufus!
14
![Page 15: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/15.jpg)
Recurrent Neural Networks (RNNs)
Hidden
Outputs
Inputs
15
![Page 16: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/16.jpg)
To model sequences, we need
1. to deal with variable length sequences2. to maintain sequence order3. to keep track of long-term dependencies 4. to share parameters across the sequence
16
![Page 17: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/17.jpg)
Recurrent Neural Networks
17
![Page 18: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/18.jpg)
Hidden
Outputs
Inputs
time
Recurrent Neural Networks (RNNs)
18
![Page 19: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/19.jpg)
time
Hidden
Outputs
Inputs
Recurrent Neural Networks (RNNs)
19
![Page 20: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/20.jpg)
Hidden
Outputs
Inputs
Recurrent!
Recurrent Neural Networks (RNNs)
20
![Page 21: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/21.jpg)
time
Hidden
Outputs
Inputs
Recurrent Neural Networks (RNNs)
21
![Page 22: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/22.jpg)
time
Hidden
Outputs
Inputs
…
Deep Recurrent Neural Networks (RNNs)
22
![Page 23: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/23.jpg)
Language Modeling
23
![Page 24: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/24.jpg)
Language modeling
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
KING LEAR:O, if you were a feeble sight, the courtesy of your law,Your sight and several breath, will wear the godsWith his heads, and my hands are wonder'd at the deeds,So drop upon your lordship's head, and your opinionShall be against your honour.
24
• Language models aim to represent the history of observed text (w1,...,wt-1) succinctly in order to predict the next word (wt):
TSWWMFPI�XEWO��PERKYEKI�QSHIP�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
EPP�XLI�[SVOW�SJ�WLEOIWTIEVI
/-2+�0)%6�3��MJ�]SY�[IVI�E�JIIFPI�WMKLX��XLI�GSYVXIW]�SJ�]SYV�PE[�=SYV�WMKLX�ERH�WIZIVEP�FVIEXL��[MPP�[IEV�XLI�KSHW;MXL�LMW�LIEHW��ERH�Q]�LERHW�EVI�[SRHIV�H�EX�XLI�HIIHW�7S�HVST�YTSR�]SYV�PSVHWLMT�W�LIEH��ERH�]SYV�STMRMSR7LEPP�FI�EKEMRWX�]SYV�LSRSYV�
PERKYEKI�QSHIP
![Page 25: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/25.jpg)
RNN language models
25
hn = g(V [xn;hn�1] + c)
yn = Whn + b
a probability distribution over possible next words, aka a softmax
![Page 26: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/26.jpg)
RNN language models
26
hn = g(V [xn;hn�1] + c)
yn = Whn + b
![Page 27: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/27.jpg)
RNN language models
27
hn = g(V [xn;hn�1] + c)
yn = Whn + b
![Page 28: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/28.jpg)
RNN language models
28
hn = g(V [xn;hn�1] + c)
yn = Whn + b
Recurrent Neural Network Language Models
hn = g(V [xn; hn�1] + c)
w0
There
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
<s>
p1
he
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
built
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . .. . . .
aardva
rk
a
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rkp2 p3 p4
w1 w2 w3
h1 h2 h3 h4
Our dictionary also includes an EOS token to decide when to stop
![Page 29: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/29.jpg)
Beam Search (K = 3)
For t =1...T :• For all k and for all possible output words w:
• Update beam:
29
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Slide credit: Alexander Rush
![Page 30: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/30.jpg)
Beam Search (K = 3)
For t =1...T :• For all k and for all possible output words w:
• Update beam:
30
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Slide credit: Alexander Rush
![Page 31: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/31.jpg)
Beam Search (K = 3)
For t =1...T :• For all k and for all possible output words w:
• Update beam:
31
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Slide credit: Alexander Rush
![Page 32: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/32.jpg)
Beam Search (K = 3)
For t =1...T :• For all k and for all possible output words w:
• Update beam:
32
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Slide credit: Alexander Rush
![Page 33: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/33.jpg)
Beam Search (K = 3)
For t =1...T :• For all k and for all possible output words w:
• Update beam:
33
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Slide credit: Alexander Rush
![Page 34: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/34.jpg)
Beam Search (K = 3)
For t =1...T :• For all k and for all possible output words w:
• Update beam:
34
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Beam Search (K = 3)
a
the
red
For t=1 . . . T :
For all k and for all possible output words w:
s(w, y(k)1:t�1) log p(y(k)1:t�1|x) + log p(w|y(k)1:t�1, x)
Update beam:
y(1:K)1:t K-argmax s(w, y(k)1:t�1)
Slide credit: Alexander Rush
![Page 35: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/35.jpg)
35
train more
train more
train more
at first:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 36: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/36.jpg)
36http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 38: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/38.jpg)
More Language Modeling Fun –Generating Super Mario Levels
38https://medium.com/@ageitgey/machine-learning-is-fun-part-2-a26a10b68df3
Original Level:
Textual Representation:
A level generated by a RNN:
![Page 39: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/39.jpg)
• Consider the problem of translation of English to French
• E.g. What is your name Comment tu t'appelle
• Is the below architecture suitable for this problem?
• No, sentences might be of different length and words might not align. Need to see entire sentence before translating
Is this enough?
39
E1 E2 E3
F1 F2 F3
Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf
![Page 40: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/40.jpg)
Encoder-decoder seq2seq model• Consider the problem of translation of English to French
• E.g. What is your name Comment tu t'appelle
• Sentences might be of different length and words might not align. Need to see entire sentence before translating
• Input-Output nature depends on the structure of the problem at hand
40Seq2Seq Learning with Neural Networks. Sutskever et al., NIPS 2014
F1 F2 F3
E1 E2 E3
F4
![Page 41: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/41.jpg)
Recurrent Networks offer a lot of flexibility:
41
Vanilla Neural Networks
![Page 42: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/42.jpg)
Recurrent Networks offer a lot of flexibility:
42
e.g. Image Captioningimage -> sequence of words
![Page 43: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/43.jpg)
Recurrent Networks offer a lot of flexibility:
43
e.g. Sentiment Classificationsequence of words -> sentiment
![Page 44: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/44.jpg)
Recurrent Networks offer a lot of flexibility:
44
e.g. Machine Translationseq of words -> seq of words
![Page 45: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/45.jpg)
Recurrent Networks offer a lot of flexibility:
45
e.g. Video classification on frame level
![Page 46: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/46.jpg)
Multi-layer RNNs• We can of course design RNNs with multiple hidden layers
• Think exotic: Skip connections across layers, across time, …46
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6
![Page 47: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/47.jpg)
Bi-directional RNNs• RNNs can process the input sequence in forward and in the reverse
direction
• Popular in speech recognition and machine translation47
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6
![Page 48: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/48.jpg)
How to Train Recurrent Neural Networks
48
![Page 49: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/49.jpg)
BackPropagation Refresher
SGD Update
W ←W −η ∂C
∂W
∂C
∂W= ∂C
∂y
⎛⎝⎜
⎞⎠⎟
∂y
∂W⎛⎝⎜
⎞⎠⎟
y = f (x;W )
C = Loss(y, yGT )
49
x
C
f (x; W)
y
Slide credit: Arun Mallya
![Page 50: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/50.jpg)
Multiple Layers
SGD Update
W2 ←W2 −η∂C
∂W2
W1 ←W1 −η∂C
∂W1
50
f1(x; W1)
x
y1
C
f2(y1; W2)
y2
y1 = f1(x;W1)
y2 = f2 (y1;W2 )
C = Loss(y2 , yGT )
Slide credit: Arun Mallya
![Page 51: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/51.jpg)
Chain Rule for Gradient Computation
Application of the Chain Rule
y1 = f1(x;W1)
y2 = f2 (y1;W2 )
C = Loss(y2 , yGT )
∂C
∂W1
= ∂C
∂y1
⎛⎝⎜
⎞⎠⎟
∂y1
∂W1
⎛⎝⎜
⎞⎠⎟
Find ∂C
∂W1
,∂C
∂W2
∂C
∂W2
= ∂C
∂y2
⎛⎝⎜
⎞⎠⎟
∂y2
∂W2
⎛⎝⎜
⎞⎠⎟
= ∂C
∂y2
⎛⎝⎜
⎞⎠⎟
∂y2
∂y1
⎛⎝⎜
⎞⎠⎟
∂y1
∂W1
⎛⎝⎜
⎞⎠⎟
51
f1(x; W1)
x
y1
C
f2(y1; W2)
y2
Slide credit: Arun Mallya
![Page 52: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/52.jpg)
Backprop through time
time
Hidden
Outputs
Inputs
52
![Page 53: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/53.jpg)
time
Hidden
Outputs
Inputs
53
![Page 54: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/54.jpg)
Recurrent linear layer
54
![Page 55: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/55.jpg)
55
[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
We have a loss at each timestep:(since we’re making a prediction at each timestep)
![Page 56: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/56.jpg)
56
[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
We have a loss at each timestep:(since we’re making a prediction at each timestep)
![Page 57: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/57.jpg)
57
We sum the losses across time:
Loss at time t:
Total loss:
[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
Jt(⇥)
J(⇥) =X
t
Jt(⇥)
![Page 58: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/58.jpg)
58
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7 @J
@W=
X
t
@Jt@W
![Page 59: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/59.jpg)
59
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
so let’s take a single timestep t:
@J
@W=
X
t
@Jt@W
![Page 60: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/60.jpg)
60
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
so let’s take a single timestep t:
@J
@W=
X
t
@Jt@W
@J2@W
=@J2@y2
@y2@s2
@s2@W
![Page 61: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/61.jpg)
63
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
so let’s take a single timestep t:
@J
@W=
X
t
@Jt@W
@J2@W
=@J2@y2
@y2@s2
@s2@W
![Page 62: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/62.jpg)
64
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
so let’s take a single timestep t:
@J
@W=
X
t
@Jt@W
@J2@W
=@J2@y2
@y2@s2
@s2@W
but wait…
![Page 63: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/63.jpg)
65
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
so let’s take a single timestep t:
@J
@W=
X
t
@Jt@W
@J2@W
=@J2@y2
@y2@s2
@s2@W
but wait…
s2 = tanh(Us1 +Wx2)
![Page 64: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/64.jpg)
66
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
so let’s take a single timestep t:
@J
@W=
X
t
@Jt@W
@J2@W
=@J2@y2
@y2@s2
@s2@W
s2 = tanh(Us1 +Wx2)but wait…
s1 also depends on W so we can’t just treat as a constant! @s2
@W
![Page 65: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/65.jpg)
67
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
![Page 66: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/66.jpg)
68
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7 @s2@W
![Page 67: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/67.jpg)
69
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7 @s2@W
+@s2@s1
@s1@W
![Page 68: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/68.jpg)
70
Let’s try it our for W with the chain rule:[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7 @s2@W
+@s2@s1
@s1@W
+@s2@s0
@s0@W
![Page 69: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/69.jpg)
Backpropagation through time:
71
@J2@W
=2X
k=0
@J2@y2
@y2@s2
@s2@sk
@sk@W
Contributions of W in previous timesteps to the error at timestep t
![Page 70: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/70.jpg)
Backpropagation through time:
72
@Jt@W
=tX
k=0
@Jt@yt
@yt@st
@st@sk
@sk@W
Contributions of W in previous timesteps to the error at timestep t
![Page 71: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/71.jpg)
Why are RNNs hard to train?
73
![Page 72: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/72.jpg)
Vanishing Gradient Problem
74
@J2@W
=2X
k=0
@J2@y2
@y2@s2
@s2@sk
@sk@W
![Page 73: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/73.jpg)
Vanishing Gradient Problem
75
@J2@W
=2X
k=0
@J2@y2
@y2@s2
@s2@sk
@sk@W
[I�LEZI�E�PSWW�EX�IEGL�XMQIWXIT�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
;
�5
9
�6
9
�6
;
�7
;
�7
9������
: : :
�5 �6 �7
�WMRGI�[IŭVI�QEOMRK�E�TVIHMGXMSR�EX�IEGL�XMQIWXIT
$5 $6 $7
@s2@s0
=@s2@s1
@s1@s0
at k=0:
![Page 74: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/74.jpg)
Vanishing Gradient Problem
76
TVSFPIQ��ZERMWLMRK�KVEHMIRX
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
�5 �6
�6 �7
�7
�5 �6 �7
�8
�8
�8
��
��
��
��������
@Jn@W
=nX
k=0
@Jn@yn
@yn@sn
@sn@sk
@sk@W
@sn@sn�1
@sn�1
@sn�2. . .
@s3@s2
@s2@s1
@s1@s0
as the gap between timestepsgets bigger, this product gets longer and longer!
![Page 75: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/75.jpg)
Vanishing Gradient Problem
77
TVSFPIQ��ZERMWLMRK�KVEHMIRX
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
�5
�5 �6
�6 �7
�7
�5 �6 �7
�8
�8
�8
��
��
��
��������
@sn@sn�1
@sn�1
@sn�2. . .
@s3@s2
@s2@s1
@s1@s0
what are each of these terms?
@sn@sn�1
= WT diag⇥f 0(Wsj�1+Uxj )
⇤
W = sampled from standard normal distribution = mostly < 1
f = tanh or sigmoid so f ’ < 1
we’re multiplying a lot of small numbers together.
![Page 76: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/76.jpg)
78
we’re multiplying a lot of small numbers together.
so what? errors due to further back timesteps have increasingly smaller gradients.
so what?parameters become biased to capture shorter-termdependencies.
Vanishing Gradient Problem
![Page 77: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/77.jpg)
A Toy Example• 2 categories of sequences
• Can the single tanh unit learn to store for T time steps 1 bit of information given by the sign of initial input?
79Slide credit: Yoshua Bengio
Simple Experiments from 1991 while I was at MIT • 2categoriesofsequences• CanthesingletanhunitlearntostoreforT1mesteps1bitof
informa1ongivenbythesignofini1alinput?
12
Prob(success|seq.lengthT)
Simple Experiments from 1991 while I was at MIT • 2categoriesofsequences• CanthesingletanhunitlearntostoreforT1mesteps1bitof
informa1ongivenbythesignofini1alinput?
12
Prob(success|seq.lengthT)Prob(success | seq. length T)
![Page 78: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/78.jpg)
80
Ű-R�*VERGI��-�LEH�E�KVIEX�XMQI�ERH�-�PIEVRX�WSQI�SJ�XLI��CCCCC�PERKYEKI�ű
RXU�SDUDPHWHUV�DUH�QRW�WUDLQHG�WR�FDSWXUH�ORQJ�WHUP�GHSHQGHQFLHV��VR�WKH�ZRUG�ZH�SUHGLFW�ZLOO�PRVWO\�GHSHQG�RQ�WKH�SUHYLRXV�IHZ�ZRUGV��QRW�PXFK�HDUOLHU�RQHV
Vanishing Gradient Problem
our parameters are not trained to capture long-term dependencies, so the word we predict will mostly depend on the previous few words, not much earlier ones
![Page 79: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/79.jpg)
Long-Term Dependencies• The RNN gradient is a product of Jacobian matrices, each associated with a step
in the forward computation. To store information robustly in a finite-dimensional state, the dynamics must be contractive [Bengio et al 1994].
• Problems:• sing. values of Jacobians > 1 → gradients explode• or sing. values < → gradients shrink & vanish• or random → variance grows exponentially
81Slide credit: Yoshua Bengio
Long-Term Dependencies • TheRNNgradientisaproductofJacobianmatrices,each
associatedwithastepintheforwardcomputa%on.Tostoreinforma%onrobustlyinafinite-dimensionalstate,thedynamicsmustbecontrac%ve[Bengioetal1994].
• Problems:
• sing.valuesofJacobians>1àgradientsexplode• orsing.values<1àgradientsshrink&vanish• orrandomàvariancegrowsexponen%ally
6
Storingbitsrobustlyrequiressing.values<1
Long-Term Dependencies • TheRNNgradientisaproductofJacobianmatrices,each
associatedwithastepintheforwardcomputa%on.Tostoreinforma%onrobustlyinafinite-dimensionalstate,thedynamicsmustbecontrac%ve[Bengioetal1994].
• Problems:
• sing.valuesofJacobians>1àgradientsexplode• orsing.values<1àgradientsshrink&vanish• orrandomàvariancegrowsexponen%ally
6
Storingbitsrobustlyrequiressing.values<1
![Page 80: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/80.jpg)
RNN Tricks
• Mini-batch creation strategies (efficient computations)
• Clipping gradients (avoid exploding gradients)
• Leaky integration (propagate long-term dependencies)
• Momentum (cheap 2nd order)
• Dropout (avoid overfitting)
• Initialization (start in right ballpark avoids exploding/vanishing)
• Sparse Gradients (symmetry breaking)
• Gradient propagation regularizer (avoid vanishing gradient)
• Gated self-loops (LSTM & GRU, reduces vanishing gradient) 82Slide adapted from Yoshua Bengio
(Pascanu et al., 2013; Bengio et al., 2013; Gal and Ghahramani, 2016;Morishita et al., 2017)
![Page 81: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/81.jpg)
Mini-batching in RNNs• Mini-batching makes things much faster! • But mini-batching in RNNs is harder than in feed-forward networks
− Each word depends on the previous word− Sequences are of various length
• Padding:
• If we use sentences of different lengths, too much padding and sortingcan result in decreased performance• To remedy this: sort sentences so similarly-lengthed seqs are in the same
batch83Slide adapted from Graham Neubig
![Page 82: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/82.jpg)
Mini-batching in RNNs• Many alternatives:
1. Shuffle the corpus randomly before creatingmini-batches (with no sorting).
2. Sort based on the source sequence length.
3. Sort based on the target sequence length.
4. Sort using the source sequence length, break ties by sorting by target sequence length.
5. Sort using the target sequence length, break ties by sorting by source sequence length.
84M. Morishita, Y. Oda, G. Neubig, K. Yoshino, K. Sudoh, and S. Nakamura. "An Empirical Study of Mini-Batch Creation Strategies for NeuralMachine Translation". 1st Workshop on NMT 2017
Figure 1: An example of mini-batching in an encoder-decoder translation model.
ample of mini-batching two sentences of differentlengths in an encoder-decoder model.
The first thing that we can notice from the fig-ure is that multiple operations at a particular timestep t can be combined into a single operation. Forexample, both “John” and ”I” are embedded in asingle step into a matrix that is passed into the en-coder LSTM in a single step. On the target sideas well, we calcualate the loss for the target wordsat time step t for every sentence in the mini-batchsimultaneously.
However, there are problems when sentencesare of different length, as only some sentences willhave any content at a particular time step. To re-solve this problem, we pad short sentences withend-of-sentence tokens to adjust their length to thelength of the longest sentence. In the Figure 1,purple colored “h/si” indicates the padded end-of-sentence token.
Padding with these tokens makes it possible tohandle variably-lengthed sentences as if they wereof the same length. On the other hand, the com-putational cost for a mini-batch increases in pro-portion to the longest sentence therein, and ex-cess padding can result in a significant amount ofwasted computation. One way to fix this prob-lem is by creating mini-batches that include sen-tences of similar length (Sutskever et al., 2014)
Algorithm 1 Create mini-batches1: C Training corpus2: C sort(C) or shuffle(C) . sort or shuffle
the whole corpus3: B {} . mini-batches4: i 0, j 05: while i < C.size() do6: B[j] B[j] +C[i]7: if B[j].size() � max mini-batch size then8: B[j] padding(B[j]) .
Padding tokens to the longest sentence in themini-batch
9: j j + 110: end if11: i i+ 112: end while13: B shuffle(B) . shuffle the order of the
mini-batches
to reduce the amount of padding required. ManyNMT toolkits implement length-based sorting ofthe training corpus for this purpose. In the fol-lowing section, we discuss several different mini-batch creation strategies used in existing neuralMT toolkits.
![Page 83: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/83.jpg)
Mini-batching in RNNs• Many alternatives:
1. Shuffle the corpus randomly before creatingmini-batches (with no sorting).
2. Sort based on the source sequence length.
3. Sort based on the target sequence length.
4. Sort using the source sequence length, break ties by sorting by target sequence length.
5. Sort using the target sequence length, break ties by sorting by source sequence length.
85M. Morishita, Y. Oda, G. Neubig, K. Yoshino, K. Sudoh, and S. Nakamura. "An Empirical Study of Mini-Batch Creation Strategies for NeuralMachine Translation". 1st Workshop on NMT 2017
Figure 1: An example of mini-batching in an encoder-decoder translation model.
ample of mini-batching two sentences of differentlengths in an encoder-decoder model.
The first thing that we can notice from the fig-ure is that multiple operations at a particular timestep t can be combined into a single operation. Forexample, both “John” and ”I” are embedded in asingle step into a matrix that is passed into the en-coder LSTM in a single step. On the target sideas well, we calcualate the loss for the target wordsat time step t for every sentence in the mini-batchsimultaneously.
However, there are problems when sentencesare of different length, as only some sentences willhave any content at a particular time step. To re-solve this problem, we pad short sentences withend-of-sentence tokens to adjust their length to thelength of the longest sentence. In the Figure 1,purple colored “h/si” indicates the padded end-of-sentence token.
Padding with these tokens makes it possible tohandle variably-lengthed sentences as if they wereof the same length. On the other hand, the com-putational cost for a mini-batch increases in pro-portion to the longest sentence therein, and ex-cess padding can result in a significant amount ofwasted computation. One way to fix this prob-lem is by creating mini-batches that include sen-tences of similar length (Sutskever et al., 2014)
Algorithm 1 Create mini-batches1: C Training corpus2: C sort(C) or shuffle(C) . sort or shuffle
the whole corpus3: B {} . mini-batches4: i 0, j 05: while i < C.size() do6: B[j] B[j] +C[i]7: if B[j].size() � max mini-batch size then8: B[j] padding(B[j]) .
Padding tokens to the longest sentence in themini-batch
9: j j + 110: end if11: i i+ 112: end while13: B shuffle(B) . shuffle the order of the
mini-batches
to reduce the amount of padding required. ManyNMT toolkits implement length-based sorting ofthe training corpus for this purpose. In the fol-lowing section, we discuss several different mini-batch creation strategies used in existing neuralMT toolkits.
• May affect performance!
Figure 2: Training curves on the ASPEC-JE test set. The y- and x-axes shows the negative log likelihoodsand number of processed sentences. The scale of the x-axis in the method (f) is different from others.
Figure 3: Training curves on the WMT2016 test set. Axes are the same as Figure 2.
accuracy of the model. We hypothesize that thelarge variance of the loss affects the final modelaccuracy, especially when using the learning algo-rithm that uses momentum such as Adam. How-ever, these results indicate that these differencesdo not significantly affect the training results. Weleave a comparison of memory consumption forfuture research.
4.2.3 Effect of Corpus Sorting Method usingAdam
From all experimental results of the method (a),(b), (c), (d) and (e), in the case of using SHUF-
FLE or SRC, perplexities drop faster and tend toconverge to lower perplexities than the other meth-ods for all mini-batch sizes. We believe the mainreason for this is due to the similarity of the sen-tences contained in each mini-batch. If the sen-tence length is similar, the features of the sentencemay also be similar. We carefully examined thecorpus and found that at least this is true for thecorpus we used (e.g. shorter sentences tend to con-tain the similar words). In this case, if we sort sen-tences by their length, sentences that have similarfeatures will be gathered into the same mini-batch,making training less stable than if all sentences
![Page 84: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/84.jpg)
Gradient Norm Clipping
86
Dealing with Gradient Explosion by Gradient Norm Clipping
21
(Mikolovthesis2012;Pascanu,Mikolov,Bengio,ICML2013)
error
✓
✓
Gradient Norm Clipping
7
Recurrent neural network regularization. Zaremba et al., arXiv 2014.
![Page 85: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/85.jpg)
Regularization: Dropout• Large recurrent networks often overfit their training data by
memorizing the sequences observed. Such models generalize poorly to novel sequences.
• A common approach in Deep Learning is to overparametrize a model, such that it could easily memorize the training data, and then heavily regularize it to facilitate generalization.
• The regularization method of choice is often Dropout.
87Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Srivastava et al. JMLR 2014.
![Page 86: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/86.jpg)
Regularization: Dropout• Dropout is ineffective when applied to recurrent connections, as
repeated random masks zero all hidden units in the limit.
• The most common solution is to only apply dropout to non-recurrent connections
88Recurrent neural network regularization. Zaremba et al., arXiv 2014.
Regularisation: Dropout
Dropout is ine↵ective when applied to recurrent connections, asrepeated random masks zero all hidden units in the limit. Themost common solution is to only apply dropout to non-recurrentconnections.10
w1 w2 w3 w4
cost1 cost2 cost3 cost4
w0
h0
h1 h2 h3 h4
w1 w2 w3
p1 p2 p3 p4
F
dropout dropout dropout dropout
dropout dropout dropout dropout
10Recurrent neural network regularization. Zaremba et al., arXiv 2014.
![Page 87: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/87.jpg)
Regularization: Dropout• A Better Solution: Use the same dropout mask at each time step
for both inputs, outputs, and recurrent layers.
89A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. Gal and Ghahramani. NIPS 2016
xt
yt
xt�1
yt�1
xt+1
yt+1
(a) Naive dropout RNNxt
yt
xt�1
yt�1
xt+1
yt+1
(b) Variational RNNFigure 1: Depiction of the dropout technique following our Bayesian interpretation (right)compared to the standard technique in the field (left). Each square represents an RNN unit, withhorizontal arrows representing time dependence (recurrent connections). Vertical arrows representthe input and output to each RNN unit. Coloured connections represent dropped-out inputs, withdifferent colours corresponding to different dropout masks. Dashed lines correspond to standardconnections with no dropout. Current techniques (naive dropout, left) use different masks at differenttime steps, with no dropout on the recurrent layers. The proposed technique (Variational RNN, right)uses the same dropout mask at each time step, including the recurrent layers.
In the new dropout variant, we repeat the same dropout mask at each time step for both inputs, outputs,and recurrent layers (drop the same network units at each time step). This is in contrast to the existingad hoc techniques where different dropout masks are sampled at each time step for the inputs andoutputs alone (no dropout is used with the recurrent connections since the use of different maskswith these connections leads to deteriorated performance). Our method and its relation to existingtechniques is depicted in figure 1. When used with discrete inputs (i.e. words) we place a distributionover the word embeddings as well. Dropout in the word-based model corresponds then to randomlydropping word types in the sentence, and might be interpreted as forcing the model not to rely onsingle words for its task.
We next survey related literature and background material, and then formalise our approximateinference for the Variational RNN, resulting in the dropout variant proposed above. Experimentalresults are presented thereafter.
2 Related Research
In the past few years a considerable body of work has been collected demonstrating the negativeeffects of a naive application of dropout in RNNs’ recurrent connections. Pachitariu and Sahani [7],working with language models, reason that noise added in the recurrent connections of an RNN leadsto model instabilities. Instead, they add noise to the decoding part of the model alone. Bayer et al. [8]apply a deterministic approximation of dropout (fast dropout) in RNNs. They reason that with dropout,the RNN’s dynamics change dramatically, and that dropout should be applied to the “non-dynamic”parts of the model – connections feeding from the hidden layer to the output layer. Pham et al. [9]assess dropout with handwriting recognition tasks. They conclude that dropout in recurrent layersdisrupts the RNN’s ability to model sequences, and that dropout should be applied to feed-forwardconnections and not to recurrent connections. The work by Zaremba, Sutskever, and Vinyals [4] wasdeveloped in parallel to Pham et al. [9]. Zaremba et al. [4] assess the performance of dropout in RNNson a wide series of tasks. They show that applying dropout to the non-recurrent connections aloneresults in improved performance, and provide (as yet unbeaten) state-of-the-art results in languagemodelling on the Penn Treebank. They reason that without dropout only small models were usedin the past in order to avoid overfitting, whereas with the application of dropout larger models canbe used, leading to improved results. This work is considered a reference implementation by many(and we compare to this as a baseline below). Bluche et al. [10] extend on the previous body of workand perform exploratory analysis of the performance of dropout before, inside, and after the RNN’sunit. They provide mixed results, not showing significant improvement on existing techniques. Morerecently, and done in parallel to this work, Moon et al. [20] suggested a new variant of dropout inRNNs in the speech recognition community. They randomly drop elements in the LSTM’s internalcell ct and use the same mask at every time step. This is the closest to our proposed approach(although fundamentally different to the approach we suggest, explained in §4.1), and we compare tothis variant below as well.
2
Each square represents an RNN unit,with horizontal arrows representingrecurrent connections. Vertical arrowsrepresent the input and output to eachRNN unit. Coloured connections rep-resent dropped-out inputs, with diffe-rent colours corresponding to differentdropout masks. Dashed lines corres-pond to standard connections with nodropout.
![Page 88: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/88.jpg)
Regularization: Norm-stabilizer• Stabilize the activations of RNNs by penalizing the squared distance
between successive hidden states’ norms
• Enforce the norms of the hiddenlayer activations approximately constant across time
90Regularizing RNNs by Stabilizing Activations. Krueger and Memisevic, ICLR 2016
![Page 89: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/89.jpg)
Regularization: Layer Normalization• Similar to batch normalization
• Computes the normalization statistics separately at each time step
• Effective for stabilizing the hidden state dynamics in RNNs
• Reduces training time
91Layer Normalization [Ba, Kiros & Hinton, 2016]Figure 2: Validation curves for the attentive reader model. BN results are taken from [Cooijmans
et al., 2016].
We trained two models: the baseline order-embedding model as well as the same model with layernormalization applied to the GRU. After every 300 iterations, we compute Recall@K (R@K) valueson a held out validation set and save the model whenever R@K improves. The best performingmodels are then evaluated on 5 separate test sets, each containing 1000 images and 5000 captions,for which the mean results are reported. Both models use Adam [Kingma and Ba, 2014] with thesame initial hyperparameters and both models are trained using the same architectural choices asused in Vendrov et al. [2016]. We refer the reader to the appendix for a description of how layernormalization is applied to GRU.
Figure 1 illustrates the validation curves of the models, with and without layer normalization. Weplot R@1, R@5 and R@10 for the image retrieval task. We observe that layer normalization offersa per-iteration speedup across all metrics and converges to its best validation model in 60% of thetime it takes the baseline model to do so. In Table 2, the test set results are reported from which weobserve that layer normalization also results in improved generalization over the original model. Theresults we report are state-of-the-art for RNN embedding models, with only the structure-preservingmodel of Wang et al. [2016] reporting better results on this task. However, they evaluate underdifferent conditions (1 test set instead of the mean over 5) and are thus not directly comparable.
6.2 Teaching machines to read and comprehend
In order to compare layer normalization to the recently proposed recurrent batch normalization[Cooijmans et al., 2016], we train an unidirectional attentive reader model on the CNN corpus bothintroduced by Hermann et al. [2015]. This is a question-answering task where a query descriptionabout a passage must be answered by filling in a blank. The data is anonymized such that entitiesare given randomized tokens to prevent degenerate solutions, which are consistently permuted dur-ing training and evaluation. We follow the same experimental protocol as Cooijmans et al. [2016]and modify their public code to incorporate layer normalization 2 which uses Theano [Team et al.,2016]. We obtained the pre-processed dataset used by Cooijmans et al. [2016] which differs fromthe original experiments of Hermann et al. [2015] in that each passage is limited to 4 sentences.In Cooijmans et al. [2016], two variants of recurrent batch normalization are used: one where BNis only applied to the LSTM while the other applies BN everywhere throughout the model. In ourexperiment, we only apply layer normalization within the LSTM.
The results of this experiment are shown in Figure 2. We observe that layer normalization not onlytrains faster but converges to a better validation result over both the baseline and BN variants. InCooijmans et al. [2016], it is argued that the scale parameter in BN must be carefully chosen and isset to 0.1 in their experiments. We experimented with layer normalization for both 1.0 and 0.1 scaleinitialization and found that the former model performed significantly better. This demonstrates thatlayer normalization is not sensitive to the initial scale in the same way that recurrent BN is. 3
6.3 Skip-thought vectors
Skip-thoughts [Kiros et al., 2015] is a generalization of the skip-gram model [Mikolov et al., 2013]for learning unsupervised distributed sentence representations. Given contiguous text, a sentence is
2https://github.com/cooijmanstim/Attentive_reader/tree/bn3We only produce results on the validation set, as in the case of Cooijmans et al. [2016]
7
3.1 Layer normalized recurrent neural networks
The recent sequence to sequence models [Sutskever et al., 2014] utilize compact recurrent neuralnetworks to solve sequential prediction problems in natural language processing. It is commonamong the NLP tasks to have different sentence lengths for different training cases. This is easy todeal with in an RNN because the same weights are used at every time-step. But when we apply batchnormalization to an RNN in the obvious way, we need to to compute and store separate statistics foreach time step in a sequence. This is problematic if a test sequence is longer than any of the trainingsequences. Layer normalization does not have such problem because its normalization terms dependonly on the summed inputs to a layer at the current time-step. It also has only one set of gain andbias parameters shared over all time-steps.
In a standard RNN, the summed inputs in the recurrent layer are computed from the current inputxt and previous vector of hidden states ht�1 which are computed as at = Whhh
t�1 +Wxhxt. Thelayer normalized recurrent layer re-centers and re-scales its activations using the extra normalizationterms similar to Eq. (3):
ht = f
h g
�t�
�at � µ
t�+ b
iµt =
1
H
HX
i=1
at
i�t =
vuut 1
H
HX
i=1
(ati� µt)2 (4)
where Whh is the recurrent hidden to hidden weights and Wxh are the bottom up input to hiddenweights. � is the element-wise multiplication between two vectors. b and g are defined as the biasand gain parameters of the same dimension as ht.
In a standard RNN, there is a tendency for the average magnitude of the summed inputs to the recur-rent units to either grow or shrink at every time-step, leading to exploding or vanishing gradients. Ina layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summedinputs to a layer, which results in much more stable hidden-to-hidden dynamics.
4 Related work
Batch normalization has been previously extended to recurrent neural networks [Laurent et al., 2015,Amodei et al., 2015, Cooijmans et al., 2016]. The previous work [Cooijmans et al., 2016] suggeststhe best performance of recurrent batch normalization is obtained by keeping independent normal-ization statistics for each time-step. The authors show that initializing the gain parameter in therecurrent batch normalization layer to 0.1 makes significant difference in the final performance ofthe model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. Inweight normalization, instead of the variance, the L2 norm of the incoming weights is used tonormalize the summed inputs to a neuron. Applying either weight normalization or batch normal-ization using expected statistics is equivalent to have a different parameterization of the originalfeed-forward neural network. Re-parameterization in the ReLU network was studied in the Path-normalized SGD [Neyshabur et al., 2015]. Our proposed layer normalization method, however, isnot a re-parameterization of the original neural network. The layer normalized model, thus, hasdifferent invariance properties than the other methods, that we will study in the following section.
5 Analysis
In this section, we investigate the invariance properties of different normalization schemes.
5.1 Invariance under weights and data transformations
The proposed layer normalization is related to batch normalization and weight normalization. Al-though, their normalization scalars are computed differently, these methods can be summarized asnormalizing the summed inputs ai to a neuron through the two scalars µ and �. They also learn anadaptive bias b and gain g for each neuron after the normalization.
hi = f(gi
�i
(ai � µi) + bi) (5)
Note that for layer normalization and batch normalization, µ and � is computed according to Eq. 2and 3. In weight normalization, µ is 0, and � = kwk2.
3
3.1 Layer normalized recurrent neural networks
The recent sequence to sequence models [Sutskever et al., 2014] utilize compact recurrent neuralnetworks to solve sequential prediction problems in natural language processing. It is commonamong the NLP tasks to have different sentence lengths for different training cases. This is easy todeal with in an RNN because the same weights are used at every time-step. But when we apply batchnormalization to an RNN in the obvious way, we need to to compute and store separate statistics foreach time step in a sequence. This is problematic if a test sequence is longer than any of the trainingsequences. Layer normalization does not have such problem because its normalization terms dependonly on the summed inputs to a layer at the current time-step. It also has only one set of gain andbias parameters shared over all time-steps.
In a standard RNN, the summed inputs in the recurrent layer are computed from the current inputxt and previous vector of hidden states ht�1 which are computed as at = Whhh
t�1 +Wxhxt. Thelayer normalized recurrent layer re-centers and re-scales its activations using the extra normalizationterms similar to Eq. (3):
ht = f
h g
�t�
�at � µ
t�+ b
iµt =
1
H
HX
i=1
at
i�t =
vuut 1
H
HX
i=1
(ati� µt)2 (4)
where Whh is the recurrent hidden to hidden weights and Wxh are the bottom up input to hiddenweights. � is the element-wise multiplication between two vectors. b and g are defined as the biasand gain parameters of the same dimension as ht.
In a standard RNN, there is a tendency for the average magnitude of the summed inputs to the recur-rent units to either grow or shrink at every time-step, leading to exploding or vanishing gradients. Ina layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summedinputs to a layer, which results in much more stable hidden-to-hidden dynamics.
4 Related work
Batch normalization has been previously extended to recurrent neural networks [Laurent et al., 2015,Amodei et al., 2015, Cooijmans et al., 2016]. The previous work [Cooijmans et al., 2016] suggeststhe best performance of recurrent batch normalization is obtained by keeping independent normal-ization statistics for each time-step. The authors show that initializing the gain parameter in therecurrent batch normalization layer to 0.1 makes significant difference in the final performance ofthe model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. Inweight normalization, instead of the variance, the L2 norm of the incoming weights is used tonormalize the summed inputs to a neuron. Applying either weight normalization or batch normal-ization using expected statistics is equivalent to have a different parameterization of the originalfeed-forward neural network. Re-parameterization in the ReLU network was studied in the Path-normalized SGD [Neyshabur et al., 2015]. Our proposed layer normalization method, however, isnot a re-parameterization of the original neural network. The layer normalized model, thus, hasdifferent invariance properties than the other methods, that we will study in the following section.
5 Analysis
In this section, we investigate the invariance properties of different normalization schemes.
5.1 Invariance under weights and data transformations
The proposed layer normalization is related to batch normalization and weight normalization. Al-though, their normalization scalars are computed differently, these methods can be summarized asnormalizing the summed inputs ai to a neuron through the two scalars µ and �. They also learn anadaptive bias b and gain g for each neuron after the normalization.
hi = f(gi
�i
(ai � µi) + bi) (5)
Note that for layer normalization and batch normalization, µ and � is computed according to Eq. 2and 3. In weight normalization, µ is 0, and � = kwk2.
3
3.1 Layer normalized recurrent neural networks
The recent sequence to sequence models [Sutskever et al., 2014] utilize compact recurrent neuralnetworks to solve sequential prediction problems in natural language processing. It is commonamong the NLP tasks to have different sentence lengths for different training cases. This is easy todeal with in an RNN because the same weights are used at every time-step. But when we apply batchnormalization to an RNN in the obvious way, we need to to compute and store separate statistics foreach time step in a sequence. This is problematic if a test sequence is longer than any of the trainingsequences. Layer normalization does not have such problem because its normalization terms dependonly on the summed inputs to a layer at the current time-step. It also has only one set of gain andbias parameters shared over all time-steps.
In a standard RNN, the summed inputs in the recurrent layer are computed from the current inputxt and previous vector of hidden states ht�1 which are computed as at = Whhh
t�1 +Wxhxt. Thelayer normalized recurrent layer re-centers and re-scales its activations using the extra normalizationterms similar to Eq. (3):
ht = f
h g
�t�
�at � µ
t�+ b
iµt =
1
H
HX
i=1
at
i�t =
vuut 1
H
HX
i=1
(ati� µt)2 (4)
where Whh is the recurrent hidden to hidden weights and Wxh are the bottom up input to hiddenweights. � is the element-wise multiplication between two vectors. b and g are defined as the biasand gain parameters of the same dimension as ht.
In a standard RNN, there is a tendency for the average magnitude of the summed inputs to the recur-rent units to either grow or shrink at every time-step, leading to exploding or vanishing gradients. Ina layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summedinputs to a layer, which results in much more stable hidden-to-hidden dynamics.
4 Related work
Batch normalization has been previously extended to recurrent neural networks [Laurent et al., 2015,Amodei et al., 2015, Cooijmans et al., 2016]. The previous work [Cooijmans et al., 2016] suggeststhe best performance of recurrent batch normalization is obtained by keeping independent normal-ization statistics for each time-step. The authors show that initializing the gain parameter in therecurrent batch normalization layer to 0.1 makes significant difference in the final performance ofthe model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. Inweight normalization, instead of the variance, the L2 norm of the incoming weights is used tonormalize the summed inputs to a neuron. Applying either weight normalization or batch normal-ization using expected statistics is equivalent to have a different parameterization of the originalfeed-forward neural network. Re-parameterization in the ReLU network was studied in the Path-normalized SGD [Neyshabur et al., 2015]. Our proposed layer normalization method, however, isnot a re-parameterization of the original neural network. The layer normalized model, thus, hasdifferent invariance properties than the other methods, that we will study in the following section.
5 Analysis
In this section, we investigate the invariance properties of different normalization schemes.
5.1 Invariance under weights and data transformations
The proposed layer normalization is related to batch normalization and weight normalization. Al-though, their normalization scalars are computed differently, these methods can be summarized asnormalizing the summed inputs ai to a neuron through the two scalars µ and �. They also learn anadaptive bias b and gain g for each neuron after the normalization.
hi = f(gi
�i
(ai � µi) + bi) (5)
Note that for layer normalization and batch normalization, µ and � is computed according to Eq. 2and 3. In weight normalization, µ is 0, and � = kwk2.
3
![Page 90: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/90.jpg)
Scheduled Sampling• “change the training
process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead.”
• During training, randomly replace a conditioningground truth token by the model's previous prediction
92Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Bengio et al., NIPS 2015
![Page 91: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/91.jpg)
Scheduled Sampling• “change the training
process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead.”
• During training, randomly replace a conditioningground truth token by the model's previous prediction
93Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Bengio et al., NIPS 2015
![Page 92: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/92.jpg)
Scheduled Sampling• “change the training
process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead.”
• During training, randomly replace a conditioningground truth token by the model's previous prediction
94Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Bengio et al., NIPS 2015
![Page 93: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/93.jpg)
Gated Cells• rather each node being just a simple RNN cell, make each node
a more complex unit with gates controlling what information is passed through
95
WSPYXMSR�����KEXIH�GIPPW�
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
VEXLIV�IEGL�RSHI�FIMRK�NYWX�E�WMQTPI�622�GIPP��QEOI�IEGL�RSHI�E�QSVI�GSQTPI\�YRMX�[MXL�KEXIW�GSRXVSPPMRK�[LEX�MRJSVQEXMSR�MW�TEWWIH�XLVSYKL�
622 0781��+69��IXG
ZW Long short term memory cells are able to keep track of information throughout many timesteps.
![Page 94: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/94.jpg)
Long Short-Term Memory (LSTM)
96Long Short-Term Memory [Hochreiter et al., 1997]
WSPYXMSR�����QSVI�SR�0781W
1-8���7����`�-RXVS�XS�(IIT�0IEVRMRK�`�-%4�����
WN WN��cj cj+1
![Page 95: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/95.jpg)
Long Short-Term Memory (LSTM)
97Long Short-Term Memory [Hochreiter et al., 1997]
cj cj+1
![Page 96: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/96.jpg)
Long Short-Term Memory (LSTM)
98Long Short-Term Memory [Hochreiter et al., 1997]
cj cj+1
![Page 97: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/97.jpg)
Long Short-Term Memory (LSTM)
99Long Short-Term Memory [Hochreiter et al., 1997]
cj cj+1
![Page 98: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/98.jpg)
Long Short-Term Memory (LSTM)
100Long Short-Term Memory [Hochreiter et al., 1997]
cj cj+1
![Page 99: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/99.jpg)
The LSTM Idea
101
ct = ct−1 + tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
ht = tanhct
* Dashed line indicates time-lag
Cellxt
ht-1
ct ht
Wxc
Whc
![Page 100: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/100.jpg)
The Original LSTM Cell
102
ct = ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
ht = ot ⊗ tanhct
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
Similarly for ot
Input Gate
Output Gate
ht
Cell
xt ht-1 xt ht-1
xt
ht-1
Wxc
Whi Wxoit
ct
otWxi Who
Whc
![Page 101: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/101.jpg)
The Popular LSTM Cell
103
ft
Forget Gate
xt ht-1
Cell
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
xt ht-1
xt
ht-1
Wxf
ft =σ Wf
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
it
ct
ht = ot ⊗ tanhct
ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
![Page 102: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/102.jpg)
The Popular LSTM Cell
104
ft
Forget Gate
xt ht-1
Cell
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
xt ht-1
xt
ht-1
Wxf
ft =σ Wf
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
it
ct
ht = ot ⊗ tanhct
ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
forget gate decides what information is going to be thrown away from the cell state
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
![Page 103: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/103.jpg)
The Popular LSTM Cell
105
ft
Forget Gate
xt ht-1
Cell
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
xt ht-1
xt
ht-1
Wxf
ft =σ Wf
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
it
ct
ht = ot ⊗ tanhct
ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
input gate and a tanh layer decides what information is going to be stored in the cell state
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
![Page 104: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/104.jpg)
The Popular LSTM Cell
106
ft
Forget Gate
xt ht-1
Cell
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
xt ht-1
xt
ht-1
Wxf
ft =σ Wf
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
it
ct
ht = ot ⊗ tanhct
ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
Update the old cell state with the new one.
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
![Page 105: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/105.jpg)
The Popular LSTM Cell
107
ft
Forget Gate
xt ht-1
Cell
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
xt ht-1
xt
ht-1
Wxf
it
ct ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
inputgate
forgetgate behavior
0 1 remember theprevious value
1 1 add to the previousvalue
0 0 erase the value1 0 overwrite the value
![Page 106: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/106.jpg)
The Popular LSTM Cell
108
ft
Forget Gate
xt ht-1
Cell
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
xt ht-1
xt
ht-1
Wxf
ft =σ Wf
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
it
ct
ht = ot ⊗ tanhct
ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
Output gate decides what is going to be outputted. The final output is based on cell state and output of sigmoid gate.
it =σ Wi
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bi
⎛⎝⎜
⎞⎠⎟
oi = �
✓Wo
✓xt
ht�1
◆+ bo
◆
![Page 107: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/107.jpg)
LSTM – Forward/Backward
109
Illustrated LSTM Forward and Backward Passhttp://arunmallya.github.io/writeups/nn/lstm/index.html
![Page 108: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/108.jpg)
LSTM variants
110
![Page 109: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/109.jpg)
The Popular LSTM Cell
111
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
ft =σ Wf
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
* Dashed line indicates time-lag
ht = ot ⊗ tanhct
Similarly for it, ot
ft
Forget Gate
xt ht-1
Cell
xt ht-1
xt
ht-1
Wxf
it
ct ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Input Gate
WhiWxi
Wxc
Whc
![Page 110: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/110.jpg)
ft
Forget Gate
xt ht-1
Cell
xt ht-1
xt
ht-1
Wxf
it
ct ht
Output Gate
ht
xt ht-1
Wxo otWho
Whf
Extension I: Peephole LSTM
112
ft =σ Wf
xt
ht−1
ct−1
⎛
⎝
⎜⎜
⎞
⎠
⎟⎟+ bf
⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
Similarly for it, ot (uses ct)
* Dashed line indicates time-lag
ct = ft ⊗ ct−1 + it ⊗ tanhWxt
ht−1
⎛⎝⎜
⎞⎠⎟
ht = ot ⊗ tanhct
Input Gate
WhiWxi
Wxc
Whc
• Add peephole connections. • All gate layers look at the cell state!
![Page 111: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/111.jpg)
Other minor variants• Coupled Input and Forget Gate
• Full Gate Recurrence ft =σ Wf
xt
ht−1
ct−1
it−1
ft−1
ot−1
⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟
+ bf
⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟
ft = 1− it
114
![Page 112: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/112.jpg)
LSTM: A Search Space Odyssey• Tested the following variants, using Peephole LSTM as standard:
1. No Input Gate (NIG) 2. No Forget Gate (NFG) 3. No Output Gate (NOG) 4. No Input Activation Function (NIAF) 5. No Output Activation Function (NOAF) 6. No Peepholes (NP) 7. Coupled Input and Forget Gate (CIFG) 8. Full Gate Recurrence (FGR)
• On the tasks of:− Timit Speech Recognition: Audio frame to 1 of 61 phonemes− IAM Online Handwriting Recognition: Sketch to characters− JSB Chorales: Next-step music frame prediction
115LSTM: A Search Space Odyssey [Greff et al., 2015]
![Page 113: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/113.jpg)
LSTM: A Search Space Odyssey• The standard LSTM performed reasonably well on multiple datasets
and none of the modifications significantly improved the performance
• Coupling gates and removing peephole connections simplified the LSTM without hurting performance much
• The forget gate and output activation are crucial
• Found interaction between learning rate and network size to be minimal – indicates calibration can be done using a small network first
116LSTM: A Search Space Odyssey [Greff et al., 2015]
![Page 114: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/114.jpg)
Gated Recurrent Unit
117
![Page 115: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/115.jpg)
Gated Recurrent Unit (GRU)• A very simplified version of the LSTM
−Merges forget and input gate into a single ‘update’ gate−Merges cell and hidden state
• Has fewer parameters than an LSTM and has been shown to outperform LSTM on some tasks
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation [Cho et al.,14]
118
![Page 116: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/116.jpg)
GRU
119
Reset Gate
ht
xt ht-1
xt ht-1
ht-1
W
Wf
xt h’t
rt =σ Wr
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
ht = (1− zt )⊗ ht−1 + zt ⊗ h 't
zt =σ Wz
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
h 't = tanhWxt
rt ⊗ ht−1
⎛⎝⎜
⎞⎠⎟
zt
rt
Update GateWz
![Page 117: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/117.jpg)
GRU
120
rt =σ Wr
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
Reset Gate
xt ht-1
Wfrt
computes a reset gate based on current input and hidden state
![Page 118: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/118.jpg)
GRU
121
Reset Gate
xt ht-1
ht-1
W
Wf
xt h’t
rt
rt =σ Wr
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
h 't = tanhWxt
rt ⊗ ht−1
⎛⎝⎜
⎞⎠⎟
computes the hidden state based on current input and hidden state
if reset gate unit is ~0, then this ignores previous memory and only stores the new input information
![Page 119: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/119.jpg)
GRU
122
rt =σ Wr
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
zt =σ Wz
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
h 't = tanhWxt
rt ⊗ ht−1
⎛⎝⎜
⎞⎠⎟
Update Gate
Reset Gate
xt ht-1
xt ht-1
ht-1
W
Wz
Wf
xt h’t
zt
rt
computes an update gate again based on current input and hidden state
![Page 120: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/120.jpg)
GRU
123
rt =σ Wr
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
ht = (1− zt )⊗ ht−1 + zt ⊗ h 't
zt =σ Wz
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
h 't = tanhWxt
rt ⊗ ht−1
⎛⎝⎜
⎞⎠⎟
Update Gate
Reset Gate
ht
xt ht-1
xt ht-1
ht-1
W
Wz
Wf
xt h’t
zt
rt
Final memory at timestep t combines both current and previous timesteps
![Page 121: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/121.jpg)
GRU Intuition• If reset is close to 0, ignore previous hidden
stateØ Allows model to drop information that is
irrelevant in the future
• Update gate z controls how much of past state should matter now. − If z close to 1, then we can copy information in
that unit through many time steps! Less vanishing gradient!
• Units with short-term dependencies often have reset gates very active
124
Update Gate
Reset Gate
ht
xt ht-1
xt ht-1
ht-1
W
Wz
Wf
xt h’t
zt
rt
rt =σ Wr
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
ht = (1− zt )⊗ ht−1 + zt ⊗ h 't
zt =σ Wz
xt
ht−1
⎛⎝⎜
⎞⎠⎟+ bf
⎛⎝⎜
⎞⎠⎟
h 't = tanhWxt
rt ⊗ ht−1
⎛⎝⎜
⎞⎠⎟
Slide credit: Richard Socher
![Page 122: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/122.jpg)
LSTMs and GRUs Good• Careful initialization and optimization of vanilla RNNs can enable them to learn
long(ish) dependencies, but gated additive cells, like the LSTM and GRU, often just work.
Bad• LSTMs and GRUs have considerably more parameters and computation per
memory cell than a vanilla RNN, as such they have less memory capacity per parameter*
125
![Page 123: DEEP UNSUPERVISED LEARNING](https://reader031.fdocuments.us/reader031/viewer/2022040613/624b99e5cfa6e15aef52c988/html5/thumbnails/123.jpg)
Next Lecture: Attention and Transformers
126