2+m``2MigL2m` HgL2irQ`FbgAAAstraka/courses/npfl114/1819/... · 2019. 5. 20. · LS6GRR9-gG2+im`2gN...
Transcript of 2+m``2MigL2m` HgL2irQ`FbgAAAstraka/courses/npfl114/1819/... · 2019. 5. 20. · LS6GRR9-gG2+im`2gN...
-
NPFL114, Lecture 9
Recurrent Neural Networks III
Milan Straka
April 29, 2019
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics
unless otherwise stated
-
Recurrent Neural Networks
Single RNN cell
input
output
state
Unrolled RNN cellsinput 1
output 1
state
input 2
output 2
state
input 3
output 3
state
input 4
output 4
state
2/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Basic RNN Cell
input
previous state
output = new state
Given an input and previous state , the new state is computed as
One of the simplest possibilities is
x(t) s(t−1)
s =(t) f(s ,x ; θ).(t−1) (t)
s =(t) tanh(Us +(t−1) V x +(t) b).
3/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Basic RNN Cell
Basic RNN cells suffer a lot from vanishing/exploding gradients (the challenge of long-termdependencies).
If we simplify the recurrence of states to
we get
If has eigenvalue decomposition of , we get
The main problem is that the same function is iteratively applied many times.
Several more complex RNN cell variants have been proposed, which alleviate this issue to somedegree, namely LSTM and GRU.
s =(t) Us ,(t−1)
s =(t) U s .t (0)
U U = QΛQ−1
s =(t) QΛ Q s .t −1 (0)
4/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Long Short-Term Memory
xt
ht−1
httanh
σ
xt ht−1
tanh
σ
xt ht−1
ct
σ
ht−1xt
Later in Gers, Schmidhuber & Cummins (1999) a possibility to forget information from memorycell was added.c t
i t
f t
o t
c t
h t
← σ(W x + V h + b )i ti
t−1i
← σ(W x + V h + b )f t f t−1f
← σ(W x + V h + b )o to
t−1o
← f ⋅ c + i ⋅ tanh(W x + V h + b )t t−1 ty
ty
t−1y
← o ⋅ tanh(c )t t
5/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Long Short-Term Memory
http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png
6/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Gated Recurrent Unit
σ
xt ht−1
ht−1
σ
ht−1xt
xt tanh
ht−1
1− +ht
r t
u t
ĥt
h t
← σ(W x + V h + b )r t r t−1r
← σ(W x + V h + b )u tu
t−1u
← tanh(W x + V (r ⋅ h ) + b )h th
t t−1h
← u ⋅ h + (1 − u ) ⋅ t t−1 t ĥt7/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Gated Recurrent Unit
http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png
8/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word Embeddings
One-hot encoding considers all words to be independent of each other.
However, words are not independent – some are more similar than others.
Ideally, we would like some kind of similarity in the space of the word representations.
Distributed RepresentationThe idea behind distributed representation is that objects can be represented using a set ofcommon underlying factors.
We therefore represent words as fixed-size embeddings into space, with the vector elements
playing role of the common underlying factors.
Rd
9/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word Embeddings
The word embedding layer is in fact just a fully connected layer on top of one-hot encoding.However, it is important that this layer is shared across the whole network.
Word in
one-hot
encoding
V
D1
V
D2
V
D3
Word in
one-hot
encoding
D
D1
D
D2
D
D3
V
D
10/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word Embeddings for Unknown Words
Recurrent Character-level WEs
Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096.
11/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word Embeddings for Unknown Words
Convolutional Character-level WEs
Figure 1 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615.
12/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Basic RNN Applications
Sequence Element ClassificationUse outputs for individual elements.
input 1
output 1
state
input 2
output 2
state
input 3
output 3
state
input 4
output 4
state
Sequence RepresentationUse state after processing the whole sequence (alternatively, take output of the last element).
13/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Structured Prediction
Consider generating a sequence of given input .
Predicting each sequence element independently models the distribution .
However, there may be dependencies among the themselves, which is difficult to capture by
independent element classification.
y , … , y ∈1 N Y N x , … ,x 1 N
P (y ∣X)i
y i
14/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Linear-Chain Conditional Random Fields (CRF)
Linear-chain Conditional Random Fields, usually abbreviated only to CRF, acts as an outputlayer. It can be considered an extension of a softmax – instead of a sequence of independentsoftmaxes, CRF is a sentence-level softmax, with additional weights for neighboring sequenceelements.
s(X,y; θ,A) = (A +i=1
∑N
y ,y i−1 i f (y ∣X))θ i
p(y∣X) = softmax (s(X, z)) z∈Y N z
log p(y∣X) = s(X,y) − logadd (s(X, z))z∈Y N
15/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Linear-Chain Conditional Random Fields (CRF)
ComputationWe can compute efficiently using dynamic programming. If we denote as
probability of all sentences with elements with the last being .
The core idea is the following:
For efficient implementation, we use the fact that
p(y∣X) α (k)tt y k
α (k) =t f (y =θ t k∣X) + logadd (α (j) +j∈Y t−1 A ).j,k
ln(a + b) = ln a + ln(1 + e ).ln b−ln a
16/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Conditional Random Fields (CRF)
DecodingWe can perform optimal decoding, by using the same algorithm, only replacing with
and tracking where the maximum was attained.
ApplicationsCRF output layers are useful for span labeling tasks, like
named entity recognitiondialog slot filling
logaddmax
17/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Connectionist Temporal Classification
Let us again consider generating a sequence of given input , but this
time and there is no explicit alignment of and in the gold data.
Figure 7.1 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.
y , … , y 1 M x , … ,x 1 NM ≤ N x y
18/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Connectionist Temporal Classification
We enlarge the set of output labels by a – (blank) and perform a classification for every inputelement to produce an extended labeling. We then post-process it by the following rules(denoted ):
1. We remove neighboring symbols.2. We remove the –.
Because the explicit alignment of inputs and labels is not known, we consider all possiblealignments.
Denoting the probability of label at time as , we define
B
l t p lt
α (s)t =def p .labeling π:B(π )=y 1:t 1:s
∑t =1′∏t
π t′t′
19/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
CRF and CTC Comparison
In CRF, we normalize the whole sentences, therefore we need to compute unnormalizedprobabilities for all the (exponentially many) sentences. Decoding can be performed optimally.
In CTC, we normalize per each label. However, because we do not have explicit alignment, wecompute probability of a labeling by summing probabilities of (generally exponentially many)extended labelings.
20/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Connectionist Temporal Classification
ComputationWhen aligning an extended labeling to a regular one, we need to consider whether the extendedlabeling ends by a blank or not. We therefore define
and compute as .
α (s)−t
α (s)∗t
p =def
labeling π:B(π )=y ,π =−1:t 1:s t
∑t =1′∏t
π t′t′
p =def
labeling π:B(π )=y ,π =−1:t 1:s t
∑t =1′∏t
π t′t′
α (s)t α (s) +−t α (s)∗
t
21/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Connectionist Temporal Classification
Figure 7.3 of the dissertation "SupervisedSequence Labelling with Recurrent Neural
Networks" by Alex Graves.
ComputationWe initialize s as follows:
We then proceed recurrently according to:
α
α (0) ←−1 p −
1
α (1) ←∗1 p y 1
1
α (s) ←−t p (α (s) +−
t−t−1 α (s))∗
t−1
α (s) ←∗t
{p (α (s) + α (s − 1) + a (s − 1)), if y = y y st
∗t−1
∗t−1
−t−1
s s−1p (α (s) + a (s − 1)), if y = y y st
∗t−1
−t−1
s s−1
22/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
CTC Decoding
Unlike CRF, we cannot perform the decoding optimally. The key observation is that while anoptimal extended labeling can be extended into an optimal labeling of a larger length, the samedoes not apply to regular (non-extended) labeling. The problem is that regular labelingcoresponds to many extended labelings, which are modified each in a different way during anextension of the regular labeling.
Figure 7.5 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.
23/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
CTC Decoding
Beam SearchTo perform beam search, we keep best regular labelings for each prefix of the extended
labelings. For each regular labeling we keep both and and by best we mean such regular
labelings with maximum .
To compute best regular labelings for longer prefix of extended labelings, for each regularlabeling in the beam we consider the following cases:
adding a blank symbol, i.e., updating both and ;
adding any non-blank symbol, i.e., updating .
Finally, we merge the resulting candidates according to their regular labeling and keep only the
best.
k
α − a ∗α +− α ∗
α − α ∗α ∗
k
24/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Unsupervised Word Embeddings
The embeddings can be trained for each task separately.
However, a method of precomputing word embeddings have been proposed, based ondistributional hypothesis:
Words that are used in the same contexts tend to have similar meanings.
The distributional hypothesis is usually attributed to Firth (1957).
25/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word2Vec
wt−2
wt−1
wt+1
wt+2
INPUT PROJECTION OUTPUT
wtSUM
CBOW (Continuous Bag Of Words)
wt
wt−2
wt−1
wt+1
wt+2
INPUT PROJECTION OUTPUT
Skip-gram
Mikolov et al. (2013) proposed two very simple architectures for precomputing wordembeddings, together with a C multi-threaded implementation word2vec.
26/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word2Vec
Table 8 of paper "Efficient Estimation of Word Representations in Vector Space", https://arxiv.org/abs/1301.3781.
27/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word2Vec – SkipGram Model
wt−2
wt−1
wt+1
wt+2
INPUT PROJECTION OUTPUT
wtSUM
CBOW (Continuous Bag Of Words)
wt
wt−2
wt−1
wt+1
wt+2
INPUT PROJECTION OUTPUT
Skip-gram
Considering input word and output , the Skip-gram model definesw i w o
p(w ∣w )o i =def
. e∑wW V w
⊤w i
eW V w o⊤
w i
28/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word2Vec – Hierarchical Softmax
Instead of a large softmax, we construct a binary tree over the words, with a sigmoid classifierfor each node.
If word corresponds to a path , we definew n ,n , … ,n 1 2 L
p (w∣w )HS i =def
σ([+1 if n is right child else -1] ⋅j=1
∏L−1
j+1 W V ).n j⊤
w i
29/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Word2Vec – Negative Sampling
Instead of a large softmax, we could train individual sigmoids for all words.
We could also only sample the negative examples instead of training all of them.
This gives rise to the following negative sampling objective:
For , both uniform and unigram distribution work, but
outperforms them significantly (this fact has been reported in several papers by differentauthors).
l (w ,w )NEG o i =def log σ(W V ) +w o
⊤w i E log (1 −
j=1
∑k
w ∼P (w)j σ(W V )).w j⊤
w i
P (w) U(w)
U(w)3/4
30/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Recurrent Character-level WEs
Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096.
31/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Convolutional Character-level WEs
Table 6 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615.
32/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Character N-grams
Another simple idea appeared simultaneously in three nearly simultaneous publications asCharagram, Subword Information or SubGram.
A word embedding is a sum of the word embedding plus embeddings of its character n-grams.Such embedding can be pretrained using same algorithms as word2vec.
The implementation can be
dictionary based: only some number of frequent character n-grams is kept;hash-based: character n-grams are hashed into buckets (usually is used).K K ∼ 106
33/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
https://arxiv.org/abs/1607.02789https://arxiv.org/abs/1607.04606http://link.springer.com/chapter/10.1007/978-3-319-45510-5_21
-
Charagram WEs
Table 7 of paper "Enriching Word Vectors with Subword Information", https://arxiv.org/abs/1607.04606.
34/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Charagram WEs
Figure 2 of paper "Enriching Word Vectors with Subword Information", https://arxiv.org/abs/1607.04606.
35/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Sequence-to-Sequence Architecture
Sequence-to-Sequence Architecture
36/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Sequence-to-Sequence Architecture
Figure 1 of paper "Sequence to Sequence Learning with Neural Networks", https://arxiv.org/abs/1409.0473.
37/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Sequence-to-Sequence Architecture
Figure 1 of paper "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation", https://arxiv.org/abs/1406.1078.
38/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Sequence-to-Sequence Architecture
BOS
x̂(0)
x(0)
x̂(1)
x(1)
x̂(2)
x(2)
x̂(3)
x(3)
EOS
BOS
x̂(0)
x̂(1)
x̂(2)
x̂(3)
EOS
TrainingThe so-called teacher forcing is used duringtraining – the gold outputs are used as inputsduring training.
InferenceDuring inference, the network processes its ownpredictions.
Usually, the generated logits are processed byan , the chosen word embedded and
used as next input.
arg max
39/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Tying Word Embeddings
Target word id
Output layer
Matrix V ×D
Target word embedding
RNN
Matrix D × V
Target word logits
40/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Attention
Figure 1 of paper "Neural Machine Translation by Jointly Learningto Align and Translate", https://arxiv.org/abs/1409.0473.
As another input during decoding, we add context vector :
We compute the context vector as a weighted combination ofsource sentence encoded outputs:
The weights are softmax of over ,
with being
c i
s =i f(s ,y , c ).i−1 i−1 i
c =i α h j
∑ ij j
α ij e ij j
α =i softmax(e ),i
e ij
e =ij v tanh(V h +⊤ j Ws +i−1 b).41/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Attention
Figure 3 of paper "Neural Machine Translation by Jointly Learning to Align and Translate", https://arxiv.org/abs/1409.0473.
42/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Subword Units
Translate subword units instead of words. The subword units can be generated in several ways,the most commonly used are
BPE – Using the byte pair encoding algorithm. Start with characters plus a special end-of-word symbol . Then, merge the most occurring symbol pair by a new symbol ,
with the symbol pair never crossing word boundary.
Considering a dictionary with words low, lowest, newer, wider:
Wordpieces – Joining neighboring symbols to maximize unigram language model likelihood.
Usually quite little subword units are used (32k-64k), often generated on the union of the twovocabularies (the so-called joint BPE or shared wordpieces).
⋅ A,B AB
r ⋅
l o
lo w
e r⋅
→ r⋅
→ lo
→ low
→ er⋅
43/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Google NMT
Figure 1 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.
44/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Google NMT
Figure 5 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.
45/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Google NMT
Figure 6 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.
46/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Beyond one Language Pair
Figure 5 of "Show and Tell: Lessons learned from the 2015 MSCOCO...", https://arxiv.org/abs/1609.06647.
47/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Beyond one Language Pair
Figure 6 of "Multimodal Compact Bilinear Pooling for VQA and Visual Grounding", https://arxiv.org/abs/1606.01847.
48/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT
-
Multilingual Translation
Many attempts at multilingual translation.
Individual encoders and decoders, shared attention.
Shared encoders and decoders.
49/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT