Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training...
Transcript of Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training...
Recurrent Neural NetworksDeep Learning β Lecture 5
Efstratios Gavves
Sequential Data
So far, all tasks assumed stationary data
Neither all data, nor all tasks are stationary though
Sequential Data: Text
What
Sequential Data: Text
What about
Sequential Data: Text
What about inputs that appear in
sequences, such as text? Could neural
network handle such modalities?
a
Memory needed
What about inputs that appear in
sequences, such as text? Could neural
network handle such modalities?
a
Pr π₯ =ΰ·
π
Pr π₯π π₯1, β¦ , π₯πβ1)
Sequential data: Video
Quiz: Other sequential data?
Quiz: Other sequential data?
Time series data
Stock exchange
Biological measurements
Climate measurements
Market analysis
Speech/Music
User behavior in websites
...
ApplicationsClick to go to the video in Youtube Click to go to the website
Machine Translation
The phrase in the source language is one sequence
β βToday the weather is goodβ
The phrase in the target language is also a sequence
β βΠΠΎΠ³ΠΎΠ΄Π° ΡΠ΅Π³ΠΎΠ΄Π½Ρ Ρ ΠΎΡΠΎΡΠ°Ρβ
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Today the weather is good <EOS> ΠΠΎΠ³ΠΎΠ΄Π° ΡΠ΅Π³ΠΎΠ΄Π½Ρ Ρ ΠΎΡΠΎΡΠ°Ρ
ΠΠΎΠ³ΠΎΠ΄Π° ΡΠ΅Π³ΠΎΠ΄Π½Ρ <EOS>
Encoder Decoder
Ρ ΠΎΡΠΎΡΠ°Ρ
Image captioning
An image is a thousand words, literally!
Pretty much the same as machine transation
Replace the encoder part with the output of a Convnet
β E.g. use Alexnet or a VGG16 network
Keep the decoder part to operate as a translator
LSTM LSTM LSTM LSTM LSTM LSTM
Today the weather is good
Today the weather is good <EOS>
Convnet
Question answering
Bleeding-edge research, no real consensusβ Very interesting open, research problems
Again, pretty much like machine translation
Again, Encoder-Decoder paradigmβ Insert the question to the encoder part
β Model the answer at the decoder part
Question answering with images alsoβ Again, bleeding-edge research
β How/where to add the image?
β What has been working so far is to add the image
only in the beginningQ: what are the people playing?
A: They play beach football
Q: John entered the living room, where he met Mary. She was drinking some wine and watching
a movie. What room did John enter?A: John entered the living room.
Recurrent Networks
Recurrentconnections
NN CellState
Input
Output
Sequences
Next data depend on previous data
Roughly equivalent to predicting what comes next
Pr π₯ =ΰ·
π
Pr π₯π π₯1, β¦ , π₯πβ1)
What about inputs that appear in
sequences,suc
h as text?Could neural
network handle such modalities?
a
Why sequences?
Parameters are reused Fewer parameters Easier modelling
Generalizes well to arbitrary lengths Not constrained to specific length
However, often we pick a βframeβ π instead of an arbitrary length
Pr π₯ =ΰ·
π
Pr π₯π π₯πβπ , β¦ , π₯πβ1)
I think, therefore, I am!
Everything is repeated, in a circle. History is a master because it
teaches us that it doesn't exist. It's the permutations that matter.
π πππ’πππππ‘πππππ(
π πππ’πππππ‘πππππ(
)
)
β‘
Quiz: What is really a sequence?
Data inside a sequence are β¦ ?
I am Bond , James
Bond
McGuire
tired
am
!
Bond
Quiz: What is really a sequence?
Data inside a sequence are non i.i.d.
β Identically, independently distributed
The next βwordβ depends on the previous βwordsβ
β Ideally on all of them
We need context, and we need memory!
How to model context and memory ?
I am Bond , James
Bond
McGuire
tired
am
!
Bond
Bond
π₯π β‘ One-hot vectors
A vector with all zeros except for the active dimension
12 words in a sequence 12 One-hot vectors
After the one-hot vectors apply an embedding
Word2Vec, GloVE
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0I
am
Bond
,
James
McGuire
tired
!
tired
Vocabulary
0 0 0 0
One-hot vectors
Quiz: Why not just indices?
I am James McGuire
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
I am James McGuire
One-hot representation Index representation
π₯π‘=1,2,3,4 =
OR?
π₯"πΌ" = 1
π₯"ππ" = 2
π₯"π½ππππ " = 4
π₯"πππΊπ’πππ" = 7
Quiz: Why not just indices?
No, because then some words get closer together for no
good reason (artificial bias between words)
I am James McGuire
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
I am James McGuire
One-hot representation Index representation
π₯"πΌ"
OR?
π"πΌ" = 1
π"ππ" = 2
π"π½ππππ " = 4
π"πππΊπ’πππ" = 7
πππ π‘ππππ(π"ππ", π"πππΊπ’πππ") = 5
π₯"ππ"
π₯"π½ππππ "
π₯"πππΊπ’πππ"
πππ π‘ππππ(π"πΌ", π"ππ") = 1
πππ π‘ππππ(π₯"ππ", π₯"πππΊπ’πππ") = 1
πππ π‘ππππ(π₯"πΌ", π₯"ππ") = 1
Memory
A representation of the past
A memory must project information at timestep π‘ on a
latent space ππ‘ using parameters π
Then, re-use the projected information from π‘ at π‘ + 1ππ‘+1 = β(π₯π‘+1, ππ‘; π)
Memory parameters π are shared for all timesteps π‘ = 0,β¦ππ‘+1 = β(π₯π‘+1, β(π₯π‘ , β(π₯π‘β1, β¦ β π₯1, π0; π ; π); π); π)
Memory as a Graph
Simplest model
β Input with parameters π
β Memory embedding with parameters π
β Output with parameters π
π₯π‘
π¦π‘
π
π
π
Output parameters
Input parameters
Memory
parameters
Input
Output
ππ‘Memory
Memory as a Graph
Simplest model
β Input with parameters π
β Memory embedding with parameters π
β Output with parameters π
π₯π‘
π¦π‘
π₯π‘+1
π¦π‘+1
π
π
π
Output parameters
Input parameters Input
Output
π
π
πMemory
parametersππ‘ ππ‘+1
Memory
Memory as a Graph
Simplest model
β Input with parameters π
β Memory embedding with parameters π
β Output with parameters π
π₯π‘
π¦π‘
π₯π‘+1 π₯π‘+2 π₯π‘+3
π¦π‘+1 π¦π‘+2 π¦π‘+3
π
π
πππ‘ ππ‘+1 ππ‘+2 ππ‘+3
Output parameters
Input parameters π
π
π
π
π
π
π
π
π
Memory
Input
Output
π
π
πMemory
parameters
Folding the memory
π₯π‘
π¦π‘
π₯π‘+1 π₯π‘+2
π¦π‘+1 π¦π‘+2
π
π
π
π
ππ‘ ππ‘+1 ππ‘+2
π
π
π
π₯π‘
π¦π‘
ππ
π
ππ‘(ππ‘β1)
Unrolled/Unfolded Network Folded Network
π
Recurrent Neural Network (RNN)
Only two equations
π₯π‘
π¦π‘
ππ
π
ππ‘
(ππ‘β1)
ππ‘ = tanh(π π₯π‘ +πππ‘β1)
π¦π‘ = softmax(π ππ‘)
RNN Example
Vocabulary of 5 words
A memory of 3 units
β Hyperparameter that we choose like layer size
β ππ‘: 3 Γ 1 ,W: [3 Γ 3]
An input projection of 3 dimensions
β U: [3 Γ 5]
An output projections of 10 dimensions
β V: [10 Γ 3]
ππ‘ = tanh(π π₯π‘ +πππ‘β1)
π¦π‘ = softmax(π ππ‘)
π β π₯π‘=4 =
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
π¦1 π¦2 π¦3
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
Final output
π
π
π
π
π
π
π
π
π1 π2 π3π¦
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3
ππ₯ π₯
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
π¦1 π¦2 π¦3
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3
Final output
π
π
π
π
π
π
π
π
π1 π2π π3π₯ π₯
π¦
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
β’ Sometimes intermediate outputs are not even neededβ’ Removing them, we almost end up to a standard
Neural Network
π¦1 π¦2 π¦3
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3
Final output
π
π
π
π
π
π
π
π
π1 π2π π3π₯ π₯
π¦
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
π¦
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3 Final output
π
π
π
π
π
π
π2π π3π₯ π₯
π¦π1
Training Recurrent Networks
Cross-entropy loss
π =ΰ·
π‘,π
π¦π‘πππ‘π β β = β log π =
π‘
βπ‘ =β1
π
π‘
ππ‘ log π¦π‘
Backpropagation Through Time (BPTT)
β Again, chain rule
β Only difference: Gradients survive over time steps
Training RNNs is hard
Vanishing gradients
β After a few time steps the gradients become almost 0
Exploding gradients
β After a few time steps the gradients become huge
Canβt capture long-term dependencies
Alternative formulation for RNNs
An alternative formulation to derive conclusions and
intuitions
ππ‘ = π β tanh(ππ‘β1) + π β π₯π‘ + π
β =
π‘
βπ‘(ππ‘)
Another look at the gradients
β = πΏ(ππ(ππβ1 β¦ π1 π₯1, π0;π ;π ;π ;π)
πβπ‘ππ
=
π=1
tπβπ‘πππ‘
πππ‘πππ
πππππ
πβ
πππ‘
πππ‘πππ
=πβ
πctβ πππ‘πctβ1
β πππ‘β1πctβ2
β β¦ β πππ+1πcΟ
β€ ππ‘βππβπ‘πππ‘
The RNN gradient is a recursive product of πππ‘
πctβ1
π ππ π‘ β short-term factors π‘ β« π β long-term factors
Exploding/Vanishing gradients
πβ
πππ‘=
πβ
πcTβ ππππcTβ1
β πππβ1πcTβ2
β β¦ β πππ‘+1πcππ‘
πβ
πππ‘=
πβ
πcTβ ππππcTβ1
β πππβ1πcTβ2
β β¦ β ππ1πcππ‘
< 1 < 1 < 1
πβ
ππβͺ 1βΉ Vanishing
gradient
> 1 > 1 > 1
πβ
ππβ« 1βΉ Exploding
gradient
Vanishing gradients
The gradient of the error w.r.t. to intermediate cell
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
= ΰ·
π‘β₯πβ₯π
π β π tanh(ππβ1)
Vanishing gradients
The gradient of the error w.r.t. to intermediate cell
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
= ΰ·
π‘β₯πβ₯π
π β π tanh(ππβ1)
Long-term dependencies get exponentially smaller weights
Gradient clipping for exploding gradients
Scale the gradients to a threshold
1. g βπβ
ππ
2. if π > π0:
g βπ0
ππ
else:
print(βDo nothingβ)
Simple, but works!
π0
g
π0π
π
Pseudocode
Rescaling vanishing gradients?
Not good solution
Weights are shared between timesteps Loss summed
over timesteps
β =
π‘
βπ‘ βΉπβ
ππ=
π‘
πβπ‘ππ
πβπ‘ππ
=
π=1
tπβπ‘πππ
πππππ
=
π=1
π‘πβπ‘πππ‘
πππ‘πππ
πππππ
Rescaling for one timestep (πβπ‘
ππ) affects all timesteps
β The rescaling factor for one timestep does not work for another
More intuitively
πβ1ππ
πβ2ππ
πβ3ππ
πβ4ππ
πβ5ππ
πβ
ππ=
πβ1
ππ+
πβ2
ππ+πβ3
ππ+
πβ4
ππ+
πβ5
ππ
πβ1ππ
πβ4ππ
πβ2ππ πβ3
πππβ5ππ
More intuitively
Letβs say πβ1
ππβ 1,
πβ2
ππβ 1/10,
πβ3
ππβ 1/100,
πβ4
ππβ 1/1000,
πβ5
ππβ 1/10000
πβ
ππ=
π
πβπππ
= 1.1111
If πβ
ππrescaled to 1
πβ5
ππβ 10β5
Longer-term dependencies negligible
β Weak recurrent modelling
β Learning focuses on the short-term only
πβ
ππ=
πβ1
ππ+
πβ2
ππ+πβ3
ππ+
πβ4
ππ+
πβ5
ππ
Recurrent networks β Chaotic systems
Fixing vanishing gradients
Regularization on the recurrent weights
β Force the error signal not to vanish
Ξ© =
π‘
Ξ©t =
π‘
πβπππ‘+1
πππ‘+1πππ‘
πβπππ‘+1
β 1
2
Advanced recurrent modules
Long-Short Term Memory module
Gated Recurrent Unit module
Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013
Advanced Recurrent Nets
+
πππ
tanh
tanh
Input
Output
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Letβs have a look at the loss function
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Letβs have a look at the loss function
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Solution: have an activation function with gradient equal to 1
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
β Identify function has a gradient equal to 1
By doing so, gradients do not become too small not too large
Long Short-Term MemoryLSTM: Beefed up RNN
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
InputOutput
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
More info at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
ππ‘ = π β tanh(ππ‘β1) + π β π₯π‘ + π
LSTM
Simple
RNN
- The previous state ππ‘β1 is connected to new
ππ‘ with no nonlinearity (identity function).
- The only other factor is the forget gate πwhich rescales the previous LSTM state.
Cell state
The cell state carries the essential information over time
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
Cell state line
+
πππ
tanh
tanh
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
LSTM nonlinearities
π β (0, 1): control gate β something like a switch
tanh β β1, 1 : recurrent nonlinearity
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
Input
Output
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
LSTM Step-by-Step: Step (1)
E.g. LSTM on βYesterday she slapped me. Today she loves me.β
Decide what to forget and what to remember for the new memory
β Sigmoid 1 Remember everything
β Sigmoid 0 Forget everything
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
LSTM Step-by-Step: Step (2)
Decide what new information is relevant from the new input
and should be add to the new memory
β Modulate the input ππ‘β Generate candidate memories ππ‘
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
LSTM Step-by-Step: Step (3)Compute and update the current cell state ππ‘
β Depends on the previous cell state
β What we decide to forget
β What inputs we allow
β The candidate memories
+
πππ
tanh
tanh
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
LSTM Step-by-Step: Step (4)
Modulate the output
β Does the new cell state relevant? Sigmoid 1
β If not Sigmoid 0
Generate the new memory
+
πππ
tanh
tanh
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
LSTM Unrolled Network
Macroscopically very similar to standard RNNs
The engine is a bit different (more complicated)
β Because of their gates LSTMs capture long and short term
dependencies
Γ +
πππ
Γ
tanh
Γ
tanh
Γ +
πππ
Γ
tanh
Γ
tanh
Γ +
πππ
Γ
tanh
Γ
tanh
Beyond RNN & LSTM
LSTM with peephole connections
β Gates have access also to the previous cell states ππ‘β1 (not only memories)
β Coupled forget and input gates, ππ‘ = ππ‘ β ππ‘β1 + 1 β ππ‘ β ππ‘β Bi-directional recurrent networks
Gated Recurrent Units (GRU)
Deep recurrent architectures
Recursive neural networks
β Tree structured
Multiplicative interactions
Generative recurrent architectures
LSTM (2)
LSTM (1)
LSTM (2)
LSTM (1)
LSTM (2)
LSTM (1)
Take-away message
Recurrent Neural Networks (RNN) for sequences
Backpropagation Through Time
Vanishing and Exploding Gradients and Remedies
RNNs using Long Short-Term Memory (LSTM)
Applications of Recurrent Neural Networks
Thank you!