Recurrent Neural Networks
Instructor: Yoav Artzi
CS5740: Natural Language Processing
Adapted from Yoav Goldberg’s Book and slides by Sasha Rush
Overview
• Finite state models• Recurrent neural networks (RNNs)• Training RNNs• RNN Models• Long short-term memory (LSTM)• Attention
Text Classification• Consider the example:– Goal: classify sentiment
How can you not see this movie?You should not see this movie.
• Model: bag of words• How well will the classifier work?– Similar unigrams and bigrams
• Generally: need to maintain a state to capture distant influences
Finite State Machines
• Simple, classical way of representing state
• Current state: saves necessary past information
• Example: email address parsing
Deterministic Finite State Machines• 𝑆 – states• Σ – vocabulary • 𝑠! ∈ 𝑆 – start state • 𝑅: 𝑆 ×Σ → 𝑆 – transition function
• What does it do?– Maps input 𝑤!, … ,𝑤" to states 𝑠!, … , 𝑠"– For all 𝑖 ∈ 1,… , 𝑛
𝑠# = 𝑅(𝑠#$!, 𝑤#)• Can we use it for POS tagging? Language
modeling?
Types of State Machines• Acceptor– Compute final state 𝑠! and make a decision
based on it: 𝑦 = 𝑂(𝑠!)• Transducers– Apply function 𝑦" = 𝑂(𝑠") to produce output
for each intermediate state• Encoders– Compute final state 𝑠!, and use it in another
model
Recurrent Neural Networks
• Motivation:– Neural network model, but with state– How can we borrow ideas from FSMs?
• RNNs are FSMs …–… with a twist– No longer finite in the same sense
RNN
• 𝑆 = ℝ!!"# - hidden state space• Σ = ℝ!"$ - input state space• 𝒔" ∈ 𝑆 - initial state vector• 𝑅 ∶ ℝ!"$×ℝ!!"# → ℝ!!"# - transition
function• Simple definition of 𝑅:
𝑅#$%&' 𝒔, 𝒙 = tanh( 𝒙, 𝒔 𝑾 + 𝒃)
Elman (1990)* Notation: vectors and matrices are bold
RNN• Map from dense sequence to dense
representation– 𝒙%, … , 𝒙& → 𝒔%, … , 𝒔&– For all 𝑖 ∈ 1,… , 𝑛
𝒔' = 𝑅 𝒔'(%, 𝒙'– 𝑅 is parameterized, and parameters are shared
between all steps– Example:𝒔) = 𝑅 𝒔*, 𝒙) = ⋯ = 𝑅(𝑅 𝑅 𝑅 𝒔+, 𝒙% , 𝒙, , 𝒙* , 𝒙))
RNNs• Hidden states 𝒔! can be used in different
ways• Similar to finite state machines– Acceptor– Transducer– Encoder
• Output function maps vectors to symbols: 𝑂:ℝ"%&' → ℝ"()*
• For example: single layer + softmax𝑂 𝒔! = softmax(𝒔!𝑾+ 𝒃)
Graphical Representation
Recursive Representation Unrolled Representation
Graphical Representation
Training• RNNs are trained with SGD and Backprop• Define loss over outputs
– Depends on supervision and task• Backpropagation through time (BPTT)
– Use unrolled representation– Run forward propagation– Run backward propagation– Update all weights
• Weights are shared between time steps– Sum the contributions of each time step to the gradient
• Inefficient– Batch helps, common but tricky to implement with variable-size
models (good helper methods in PyTorch, non-issue with auto batching in DyNet)
RNN: Acceptor Architecture• Only care about the output from the last hidden
state• Train: supervised, loss on prediction• Example:– Text classification
Language Modeling• Input: 𝑋 = 𝑥#, … , 𝑥$• Goal: compute 𝑝(𝑋)• Bi-gram decomposition:
𝑝 𝑋 =8!%#
$
𝑝(𝑥! ∣ 𝑥!&#)
• With RNNs, can do non-Markovian models:
𝑝 𝑋 =8!%#
$
𝑝(𝑥! ∣ 𝑥#, … , 𝑥!&#)
RNN: Transducer Architecture
• Predict output for every time step
Language Modeling• Input: 𝑋 = 𝑥#, … , 𝑥$• Goal: compute 𝑝(𝑋)• Model:
𝑝 𝑋 =8!%#
$
𝑝(𝑥! ∣ 𝑥#, … , 𝑥!&#)
𝑝 𝑥! 𝑥#, … , 𝑥!&# = 𝑂 𝒔! = 𝑂(𝑅 𝒔! , 𝒙!&# )𝑂 𝒔! = softmax(𝑠!𝑾+ 𝒃)
• Predict next token =𝑦! as we go:=𝑦! = argmax𝑂(𝒔!)
RNN: Transducer Architecture• Predict output for every time step• Examples:– Language modeling– POS tagging– NER
RNN: Transducer Architecture
X =x1, . . . ,xn
si =R(si�1,xi), i = 1, . . . , n
O(si) =softmax(siW + b)
yi =argmaxO(si)<latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit>
RNN: Encoder Architecture• Similar to acceptor• Difference: last state is used as input to
another model and not for prediction𝑂 𝑠! = 𝑠!à 𝑦$ = 𝑠$
• Example:– Sentence embedding
Bidirectional RNNs• RNN decisions are based on historical data only– How can we account for future input?
• When is it relevant? Feasible?
Bidirectional RNNs• RNN decisions are based on historical data only
– How can we account for future input?• When is it relevant? Feasible?
• When all the input is available. Not for real-time input.• Probabilistic model, for example for language modeling:
𝑝 𝑋 =%!"#
$
𝑝(𝑥! ∣ 𝑥#, … , 𝑥!%#, 𝑥!&#, … , 𝑥$)
Deep RNNs• Can also make RNNs deeper (vertically) to
increase model capacity
RNN: Generator• Special case of the transducer architecture• Generation conditioned on 𝒔'• Probabilistic model:
𝑝 𝑋 𝑠' = %!"#
$
𝑝(𝑥! ∣ 𝑥#, … , 𝑥!%#, 𝑠')
RNN: Generator
• Stop when generating the STOP token• During learning (usually): force predicting
the annotated token and compute loss
sj =R(sj�1, E(tj�1))
O(sj) =softmax(sjW + b)
tj =argmaxO(sj)<latexit sha1_base64="fcMW9lT2t0Nfd/OMxP9HWlhPBBM=">AAACqnicbVHbThsxEPVub3R7C/DYl1GjtkSl0S5Foi9IqAipLxUUNQlVHEVex5s4eC+yZ1Gi1X4cv9A3/gZvWFBIOpKl43NmfMYzYaakQd+/cdwnT589f7Hx0nv1+s3bd43Nra5Jc81Fh6cq1RchM0LJRHRQohIXmRYsDpXohZfHld67EtrINPmD80wMYjZOZCQ5Q0sNG9c0ZjgJo8KUwykcfoLznSWmmH4Nyl04eeDohGGBZa20WkCpd7pcMW1VjwBFMcPCpBHGbFY+SrjHvRK+wP0lLFuUgrdms2iJMj22ygxWnSp3b9ho+m1/EbAOgho0SR1nw8Y/Okp5HosEuWLG9AM/w0HBNEquROnR3IiM8Us2Fn0LExYLMygWoy7ho2VGEKXangRhwS5XFCw2Zh6HNrNq1axqFfk/rZ9j9H1QyCTLUST8zijKFWAK1d5gJLXgqOYWMK6l7RX4hGnG0W63GkKw+uV10N1rB9/ae7/3m0c/6nFskPfkA9khATkgR+QnOSMdwp3Pzi+n6/TcXffc/ev271Jdp67ZJo/CHd0CXEHQMg==</latexit>
Example: Caption Generation• Given: image 𝐼• Goal: generate caption• Set 𝒔! = CNN(𝐼)• Model:
𝑝 𝑋 𝐼 = 4"#$
%
𝑝(𝑥" ∣ 𝑥$, … , 𝑥"&$, 𝐼)
Examples from Karpathyand Fei-Fei 2015
Sequence-to-Sequence• Connect encoder and
generator• Many alternatives:
– Set generator 𝒔!' to encoder output 𝒔%(
– Concatenate 𝒔%( with each step input during generation
• Examples:– Machine translation– Chatbots– Dialog systems
• Can also generate other sequences – not only natural language!
Sequence-to-SequenceX =x1, . . . ,xn
sEi =RE(sEi�1,xi), i = 1, . . . , n
c =OE(sEn )
sDj =RD(sDj�1, [E(tj�1); c])
OD(sDj ) =softmax(sDj W + b)
tj =argmaxOD(sDj )
<latexit sha1_base64="Fg9TrOqDzjrPAUnGt7bL5pLUSSQ=">AAADYXicbVJNb9NAEN04BdpAS1qOvYyIqGoRIhsVgYQqVZBI3FoQaSLFibXerJNN/SXvGBJZ/pPcuHDhj7BOHBonHcnS2zcz741H40SekGgYvytade/R4yf7B7Wnzw6PntePT25lmMSMd1nohXHfoZJ7IuBdFOjxfhRz6jse7zl3n/N87wePpQiD77iI+NCnk0C4glFUlH1c+dmHSzgDy6c4ddx0ntlm0xqHKJsbVACWVVu/ZWaLUQcuz+Cb3TnfYFPxxsxGneammNCbIJTDWrSkxLJc5bqsEow6+pbdbNRe2bVLdrPcrt2EwX2/NaWYYlYk9Y9wbzVcql6XNZSynkuDhXyOqQxd9Ok82ypZv3oZvP6v6GS6ZUFtx3mW61k0nqjMHB7yy+eo2fWG0TKWAbvALECDFHFj13+pDbLE5wEyj0o5MI0IhymNUTCPZzUrkTyi7I5O+EDBgPpcDtPlhWTwSjFjcMNYfQHCkt3sSKkv5cJ3VGU+rNzO5eRDuUGC7odhKoIoQR6wlZGbeIAh5OcGYxFzht5CAcpioWYFNqUxZaiOMl+Cuf3Lu+D2bcu8aL37etG4+lSsY5+ckpfknJjkPbkiX8gN6RJW+aPtaYfakfa3elCtV09WpVql6HlBSlE9/QcFXg2c</latexit>
Sequence-to-Sequence Training Graph
Long-range Interactions
• Promise: Learn long-range interactions of language from data
• Example:How can you not see this movie?You should not see this movie.
• Sometimes: requires ”remembering” early state– Key signal here is at 𝑠#, but gradient is at 𝑠!
Long-term Gradients• Gradient go through (many) multiplications• OK at end layers à close to the loss• But: issue with early layers• For example, derivative of tanh
𝑑𝑑𝑥
tanh 𝑥 = 1 − tanh'𝑥– Large activation à gradient disappears (vanish)
• In other activation functions, values can become larger and larger (explode)
Exploding Gradients• Common when there is no
saturation in activation (e.g., ReLu) and we get exponential blowup
• Result: reasonable short-term gradient, but bad long-term ones
• Common heuristic:– Gradient clipping:
bounding all gradients by maximum value
Vanishing Gradients
• Occurs when multiplying small values– For example: when tanh saturates
• Mainly affects long-term gradients• Solving this is more complex
Long Short-term Memory (LSTM)
Hochreiter and Schmidhuber (1997)https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM vs. Elman RNN
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM
Image by Tim Rocktäschel
Cell State
Output
Input
�(·)<latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit>
ft = �(Wf [ht�1,xt] + bf )
it = �(Wi[ht�1,xt] + bf )
ct = ft � ct�1 + it � tanh(Wc[ht�1,xt] + bc)
ot = �(Wo[ht�1,xt] + bo)
ht = ot � tanh(ct)<latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit>
Attention• In seq-to-seq models, a single vector
connects encoding and decoding– Any concern?– All the input string information must encoded into
a fixed-length vector– The decoder must recover all this information
from a fixed-length vector • Attention relaxes the assumption that a
single vector must be used to encode the input sentence regardless of length
Attention
• Encode input sentence as a sequence of vectors
• At each step: pick what vector to use• But: discrete choice is not differentiable–Make the choice soft
Attention����� $0/%*5*0/&% (&/&3"5*0/ 8*5) "55&/5*0/ ���
s0 s1
y1 y2 y3 y4 y5
s2 s3 s4
predict predict predict predict predict
concat concat concat concat concat
attend attend attend attend attend
the black fox jumped </s>
c0 c1 c4c2 c3
RD, OD RD, OD RD, ODRD, OD RD, OD
BIE BIE BIE BIE BIE
E[<s>]
<s>
E[a]
a
E[conditioning]
conditioning
E[sequence]
sequence
E[</s>]
</s>
E[<s>]
<s>
E[the]
the
E[black]
black
E[fox]
fox
E[jumped]
jumped
'JHVSF ����� 4FRVFODF�UP�TFRVFODF 3// HFOFSBUPS XJUI BUUFOUJPO�
UIBU JT UIFZ SFQSFTFOU B XJOEPX GPDVTFE BSPVOE UIF JOQVU JUFN xi BOE OPU UIF JUFN JUTFMG� 4FDPOE CZ IBWJOH B USBJOBCMF FODPEJOH DPNQPOFOU UIBU JT USBJOFE KPJOUMZ XJUI UIF EFDPEFS UIF FODPEFSBOE EFDPEFS FWPMWF UPHFUIFS BOE UIF OFUXPSL DBO MFBSO UP FODPEF SFMFWBOU QSPQFSUJFT PG UIF JOQVUUIBU BSF VTFGVM GPS EFDPEJOH BOE UIBU NBZ OPU CF QSFTFOU BU UIF TPVSDF TFRVFODF x1Wn EJSFDUMZ� 'PSFYBNQMF UIF CJ3// FODPEFS NBZ MFBSO UP FODPEF UIF QPTJUJPO PG xi XJUIJO UIF TFRVFODF BOE UIFEFDPEFS DPVME VTF UIJT JOGPSNBUJPO UP BDDFTT UIF FMFNFOUT JO PSEFS PS MFBSO UP QBZ NPSF BUUFOUJPOUP FMFNFOUT JO UIF CFHJOOJOH PG UIF TFRVFODF UIFO UP FMFNFOUT BU JUT FOE�
Attention
����� $0/%*5*0/&% (&/&3"5*0/ 8*5) "55&/5*0/ ���
s0 s1
y1 y2 y3 y4 y5
s2 s3 s4
predict predict predict predict predict
concat concat concat concat concat
attend attend attend attend attend
the black fox jumped </s>
c0 c1 c4c2 c3
RD, OD RD, OD RD, ODRD, OD RD, OD
BIE BIE BIE BIE BIE
E[<s>]
<s>
E[a]
a
E[conditioning]
conditioning
E[sequence]
sequence
E[</s>]
</s>
E[<s>]
<s>
E[the]
the
E[black]
black
E[fox]
fox
E[jumped]
jumped
'JHVSF ����� 4FRVFODF�UP�TFRVFODF 3// HFOFSBUPS XJUI BUUFOUJPO�
UIBU JT UIFZ SFQSFTFOU B XJOEPX GPDVTFE BSPVOE UIF JOQVU JUFN xi BOE OPU UIF JUFN JUTFMG� 4FDPOE CZ IBWJOH B USBJOBCMF FODPEJOH DPNQPOFOU UIBU JT USBJOFE KPJOUMZ XJUI UIF EFDPEFS UIF FODPEFSBOE EFDPEFS FWPMWF UPHFUIFS BOE UIF OFUXPSL DBO MFBSO UP FODPEF SFMFWBOU QSPQFSUJFT PG UIF JOQVUUIBU BSF VTFGVM GPS EFDPEJOH BOE UIBU NBZ OPU CF QSFTFOU BU UIF TPVSDF TFRVFODF x1Wn EJSFDUMZ� 'PSFYBNQMF UIF CJ3// FODPEFS NBZ MFBSO UP FODPEF UIF QPTJUJPO PG xi XJUIJO UIF TFRVFODF BOE UIFEFDPEFS DPVME VTF UIJT JOGPSNBUJPO UP BDDFTT UIF FMFNFOUT JO PSEFS PS MFBSO UP QBZ NPSF BUUFOUJPOUP FMFNFOUT JO UIF CFHJOOJOH PG UIF TFRVFODF UIFO UP FMFNFOUT BU JUT FOE�
Attend
X =x1, . . . ,xn
sEi =RE(sEi�1,xi), i = 1, . . . , n
ci =OE(sEi )
↵ji =sDj�1 · ci
↵j =softmax(↵j
1, . . . , ↵jn)
cj =nX
i=1
↵ji ci
sDj =R(sDj�1, [E(tj�1); cj ])
OD(sDj ) =softmax(sDj W + b)
tj =argmaxOD(sDj )
<latexit sha1_base64="Z29XwgPR1TAuE6xl6WcRES6WEdk=">AAAEQHicfVNNi9NAGJ5N/FjrV1ePXgaLS4O1NLKiIIVFt+BtV7Ef0HTDZDppp5svZibSEuanefEnePPsxYMiXj05kyZL2i4OBN687zPP87wPjJcElItO59ueYV67fuPm/q3a7Tt3792vHzwY8DhlmPRxHMRs5CFOAhqRvqAiIKOEERR6ARl6F2/1fPiJME7j6KNYJWQSollEfYqRUC33wBiMYBceQidEYu752VK6dsuZxoK3Kq0IOk6t/OfSpec92D2EH9xes9LN6DNbnvdaVTJqtSBVCiXpmslDLCtBWOqbUvOdbvIpFUvBYYFHQTJH8nzhUo2t6i607gl0sJKAO+S0av6SZb21IEuR8dgXIVrK5paOrTbJTcOtQWRVKbF0CzaehmqVrjKj1szR2u3/DSn/C+Vdp9nc3akFx2VTRTNHIhOyGFqvYdXBJPd06p40N5ktTb2z6Aak/BtK+PSS05NWHn2ZW6m9yMNHbKYmS3iVnvZRc+uNTruTH7hb2EXRAMU5c+tfVdY4DUkkcIA4H9udREwyxATFAZE1J+UkQfgCzchYlREKCZ9k+QOQ8InqTKEfM/VFAubd6o0MhZyvQk8htVm+PdPNq2bjVPivJhmNklSQCK+F/DSAIob6NcEpZQSLYKUKhBlVXiGeI4awUG9Oh2Bvr7xbDJ637aP2i/dHjeM3RRz74BF4DJrABi/BMXgHzkAfYOOz8d34afwyv5g/zN/mnzXU2CvuPAQbx/z7D9jYZdk=</latexit>
Attention
• Many variants of attention function– Dot product (previous slide)–MLP– Bi-linear transformation
• Various ways to combine context vector into decoder computation
• See Luong et al. 2015
Top Related