Content
1. Example of Vanilla RNN2. RNN Forward pass3. RNN Backward pass4. LSTM design
RNN Training problem
Feed-forward (βvanillaβ) network
1
0
0
1
0
X
y
RNN
h
π hh
π hπ¦
π hπ₯
Vanilla recurrent network
1ΒΏhπ‘= tanh (π hh hπ‘β1+π hπ₯ π₯+πh )
2ΒΏ π¦=π hπ¦hπ‘+π π¦
Example: character-level language processing
X
y
RNN
Training sequence: βhelloβ
Vocabulary: [e, h, l, o]
0100
1000
0010
0001
βhββeβ βlβ β0β
π hh
π hπ¦
π hπ₯
hX Y
π hπ₯ =[3 .6 β4.8 0.35 β0.26 ]
π hπ¦=[ β12.β0.67β0.8514. ]
P
ππ¦=[β0.2β2.96.1β3.4 ]
βhelloβ RNN
hX Y P
0100
βhβ
h0=0
βhβ
hX Y P
0100
βhβ
hπ‘=tanh (π hh hπ‘β 1+π hπ₯ π₯+πh )
h0=0
βhβ
hX Y P
0100
βhβ
h=β0.99
βhβ
hX Y P
0100
βhβ
h=β0.99 π¦=π hπ¦ hπ‘+π π¦
βhβ
hX Y P
0100
βhβ
h=β0.99 π¦=[ 11.β2.26.9β17 ]
βhβ
hX Y P
0100
βhβ
h=β0.99 π¦=[ 11.β2.26.9β17 ] π=[0 .9900.010 ]
βhβ
hX Y P
0100
βhβ
h=β0.99 π¦=[ 11.β2.26.9β17 ] π=[0 .9900.010 ]
1000
βeββhβ
hX Y P
1000
βeβ
h=β0.99
βhβ βeβ
hX Y P
1000
βeβ
h=β0.99hπ‘=tanh (π hh hπ‘β 1+π hπ₯ π₯+πh )
βhβ βeβ
hX Y P
1000
βeβ
h=β0.09
βhβ βeβ
hX Y P
1000
βeβ
h=β0.09 π¦=π hπ¦ hπ‘+π π¦
βhβ βeβ
hX Y P
1000
βeβ
h=β0.09 π¦=[ 0 .86β2.86.2β4.6 ]
βhβ βeβ
hX Y P
1000
βeβ
h=β0.09 π¦=[ 0 .86β2.86.2β4.6 ] π=[ 000.990 ]
βhβ βeβ
hX Y P
1000
βeβ
h=β0.09 π¦=[ 0 .86β2.86.2β4.6 ] π=[ 000.990 ]
0010
βlββhβ βeβ
hX Y P
0010
βlβ
h=β0.09
βhβ βeβ βlβ
hX Y P
0010
βlβ
38
βhβ βeβ βlβ
hX Y P
0010
βlβ
38 π¦=[β4.7β3.25.81.9 ]
βhβ βeβ βlβ
hX Y P
0010
βlβ
38 π¦=[β4.7β3.25.81.9 ] π=[ 000.980.02]
βhβ βeβ βlβ
hX Y P
0010
βlβ
38 π¦=[β4.7β3.25.81.9 ] π=[ 000.980.02]
0010
βlββhβ βeβ βlβ
hX Y P
0010
βlβ
38
βhβ βeβ βlβ βlβ
hX Y P
0010
βlβ
98
βhβ βeβ βlβ βlβ
hX Y P
0010
βlβ
98
βhβ βeβ βlβ βlβ
π¦=[β12.β3.65.310. ]
hX Y P
0010
βlβ
98
βhβ βeβ βlβ βlβ
π¦=[β12.β3.65.310. ] π=[ 000.010.99 ]
hX Y P
0010
βlβ
98
βhβ βeβ βlβ βlβ
π¦=[β12.β3.65.310. ] π=[ 000.010.99 ]
0001
βoβ
hX Y P
98
βhβ βeβ βlβ βlβ βoβ
hX Y P
βhβ h0=0 βeββ¨
βeβ -0.99 βlββ¨
βlβ -0.09 βlββ¨
βlβ 0.38 βoββ¨
hX Y P
βhelloβ βhelloβ
βhello benβ βhello benβ
βhello worldβ βhello worldβ
hX Y P
βit wasβ βit wasβ
βit was theβ βit was theβ
βit was the bestβ βit was the bestβ
βIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishnessβ¦ β, A Tale of Two Cities, Charles Dickens
50,000
300,000 (loss = 1.6066)
1,000,000 (loss = 1.8197)
βit was the best ofβ βit wes the best ofβ 2,000,000 (loss = 4.0844)
hX Y P
β¦epoch 500000, loss: 6.447782290456328 β¦epoch 1000000, loss: 5.290576956983398 β¦epoch 1800000, loss: 4.267105168323299 epoch 1900000, loss: 4.175163586546514 epoch 2000000, loss: 4.0844739848413285
X
y
RNN
h
π hh
π hπ¦
π hπ₯
Vanilla recurrent network
1ΒΏhπ‘= tanh (π hh hπ‘β1+π hπ₯ π₯+πh )
2ΒΏ π¦=π hπ¦hπ‘+π π¦
Input:
Target:
i t β β w a s β β
t β β w a s β β t h
t
RNNs for Different Problems
Vanilla Neural Network
RNNs for Different Problems
Image Captioningimage -> sequence of words
RNNs for Different Problems
Sentiment Analysissequence of words -> class
RNNs for Different Problems
Translationsequence of words -> sequence of words
h1h0
1 1 2
3
h2
π₯0 π₯1 π₯2
πΏ= π (π hπ₯ ,π hh ,π hπ¦)
51
π hh=0.024
π€ hπ₯ βπ€ hπ₯ β0.01 βππΏππ€ hπ₯
π€hhβπ€hhβ0.01 βππΏππ€hh
π€hπ¦βπ€hπ¦β0.01βππΏππ€hπ¦
Training is hard with vanilla RNNs
π» πΏ=[ππΏππ€ hπ₯
, ππΏππ€hh, ππΏππ€h π¦
]
π hπ₯
π hh
π hπ¦
<β Forward pass
<β Backward pass
h1h0
1 1 2
3
h2
π₯0 π₯1 π₯2
ππΏππ€hh
=?
πΏ=?
y
ππΏππ€=
π πππ β
πππh β
πhππ β
πππ π β
ππππ β ππππ β
ππππ€πΏ= π (π (h(π (π (π (π (π€)))))))
ππΏππ€hh
=?
πΏ=(( π hh tanh (π hh tanh (π hh tanh (π hπ₯ π₯0)+π hπ₯ π₯1)+π hπ₯ π₯2))β3)2
Compute gradient
Recursive application of chain rule:
ππΏππ€=?
π = π (π)π=π(h)h=h (π)
Gradient by hand
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
π hh=0.024
1
Forward Pass
0.078
1.
π hπ₯
π₯0
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
0.078
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
0.078
tanh0.0778
h0
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
h0
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
h0
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
0.078
1.
π hπ₯
π₯1
h0
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970tanh
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
024
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
024
*0.0019
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
0.078
2.
π hπ₯
π₯2
024
*0.0019
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+-2.99
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
1
π hh=0.024
Forward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
ππΏππ€=
π πππ β
πππh β
πhππ β
πππ π β
ππππ β ππππ β
ππππ€
πΏ= π (π (h(π (π (π (π (π€)))))))
ππΏππ€hh
=?
Compute gradient
Recursive application of chain rule:
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π πππ β
πππh β
πhππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
πππh β
πhππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
πππh β
π hππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1
π πππ=?
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
πππh β
π hππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1
π πππ=
ππ2ππ =2π=2 (β2.99 )=β5.98
-5.98
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
π hππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1-5.98
πππh=1
-5.98
tanh
tanhπ₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1-5.98
-5.98
π hππ=π hπ¦
0.051tanh
tanh
πhππ hπ¦
=π
0.1566
-0.304
0.936
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1-5.98
-5.98
π hππ=π hπ¦
tanh
tanh
πhππ hπ¦
=π
-0.304
0.936
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
ππππ β ππππ β
ππππ€ hπ₯
1-5.98
-5.98
ππππ =1βπ
2=1β .15662=.975
-0.304-0.297tanh
tanh
0.936
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.07970
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
π πππ β ππππ β
ππππ€ hπ₯
1-5.98
-5.98
-0.297tanh
tanh-0.297-0.0071
0.936
-0.304
-0.297
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.0797
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
π πππ β ππππ β ππππ€ hπ₯
1-5.98
-5.98
-0.297tanh
tanh-0.297-0.0071
1βπ2=1β .07972=.993
-0.0071
0.936
-0.304
-0.297
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.0797
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
1-5.98
-5.98
-0.297tanh
tanh-0.297-0.0071-0.0071
-0.0071
-0.00017
0.936
-0.304
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
π πππ β ππππ β ππππ€ hπ₯
-0.0005
-0.297
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.0797
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
1-5.98
-5.98
-0.297tanh
tanh-0.297-0.0071-0.0071
-0.0071
-0.00017
1βπ2=1β .07782=.993
0.936
-0.304
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
π πππ β ππππ β ππππ€ hπ₯
-0.00017
-0.0005
-0.297
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.0797
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
1-5.98
-5.98
-0.297tanh
tanh-0.297-0.0071-0.0071
-0.0071
-0.00017
0.936
-0.304
ππΏππ€ hπ₯
=π ππ π β
π πππ β
ππππ β
ππππ β
ππππ β
π πππ β ππππ β ππ
ππ ππ
-0.00017
-0.00017
-0.0005
-0.297
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh0.0778
*0.00187
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+0.07987
h1
0.0797
*
0.078
2.
π hπ₯
π₯2
0.156
024
*0.0019
+0.1579 0.1566
h2
0.051π hπ¦
*0.0080π¦
-3
+ **
-2.99 8.95
πΏ
1-5.98
-5.98
-0.297tanh
tanh-0.297-0.0071-0.0071
-0.0071
-0.00017
0.936
-0.304
-0.00017
-0.00017
-0.0005
-0.297π€πβπ€πβ0.01 β
ππΏππ€π
π€ hπ₯ β0.078β0.01β (β .00017 )=0.0780017
π€hhβ0.024β0.01 β (β .0005 )=0.024005
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
Backward Pass
*
0.078
1.
π hπ₯
π₯0
024
0.078
tanh
*
*
0.078
1.
π hπ₯
π₯1
0.078
h0
+
h1
*
0.078
2.
π hπ₯
π₯2
0.156
024
*
+0.1579
0.051π hπ¦
*
+ **
1-5.98
tanh
tanh-0.297-0.0071
-0.0071
-0.00017
π₯1π₯0
h1h0
1 2
h2
π₯2
3
1
ππΏπ π₯=π€hhβ¦π€hhβ¦π€hhβ¦π€hh=π€hh
π βπΆ (π€)
π€hhπ€hhπ€hhπ€hhπ€hh
1. 0.024 2. 0.000576 3. 1.382e-05 4. 3.318e-07 5. 7.963e-09 6. 1.911e-10 7. 4.586e-12 8. 1.101e-13 9. 2.642e-1510. 6.340e-17
π hh=0.024tanh tanhtanhtanhtanhtanh
Source: https://imgur.com/gallery/vaNahKE
W
x
2n
4n
(ππππ)=(
π ππππ ππππ πππ
hπ‘ππ )π ( π₯hπ‘β1)
ππ‘= π βππ‘β 1+ πβπ
hπ‘=π β tanh (ππ‘)
i
f
o
g
x
h
Long Short-Term Memory (LSTM)
n
n
n
n
π
π
π
π
π‘β1 π‘
hπ‘=( tanh )π ( π₯hπ‘β 1) - RNN
ππ‘= π βππ‘β 1+ πβπ
hπ‘=tanh (π hh hπ‘β 1+π hπ₯ π₯ )RNN:
LSTM:
(ππππ)=(
π ππππ ππππ πππ
hπ‘ππ )π ( π₯hπ‘β1)
ππ‘= π βππ‘β 1+ πβπ
hπ‘=π β tanh (ππ‘)
forgetgate,0/1
inputgate, 0/1
f
incomingX
i og
+
X
tanh
X
Long Short-Term Memory (LSTM)
(ππππ)=(
π ππππ ππππ πππ
hπ‘ππ )π ( π₯hπ‘β1)
ππ‘= π βππ‘β 1+ πβπ
hπ‘=π β tanh (ππ‘)
ππ‘β 1
hπ‘
ππΏπ π₯=π€hhβ¦π€hhβ¦π€hhβ¦π€hh=π€hh
π βπΆ (π€)
π€hhπ€hhπ€hh
f f f
f f f
+ + +
RNN
LSTM
Flow of gradient
π‘β1 π‘ π‘+1
π‘β1 π‘ π‘+1
Source: https://imgur.com/gallery/vaNahKE
Long Short-Term Memory (LSTM)
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Reference
1. Long Term-Short Memory (Hochreiter, 1997), http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
2. Learning Long Term Dependencies With Gradient Descent is Difficult (Yoshua Bengio, 1994), http://www.dsi.unifi.it/~paolo/ps/tnn-94-gradient.pdf
3. http://neuralnetworksanddeeplearning.com/chap5.html
4. Deep Learning, Ian Goodfellow et al., The MIT Press
5. Recurrent Neural Networks, LSTM, Andrej Karpathy, Stanford Lectures, https://www.youtube.com/watch?v=iX5V1WpxxkY
Alex Kalinin [email protected]
Top Related