On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit...
Transcript of On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit...
On Using Monolingual Corpora in Neural MachineTranslation
&Towards better decoding and integration of language
models in sequence to sequence models
Rebekka Hubert
Department of Computational Linguistics (ICL)University of Heidelberg
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models1 / 35
1 Monolingual Corpora in NMT
2 On language models in end-to-end systems
3 Conclusions
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models2 / 35
Motivation
NMT systems suffer from low resources and domain restriction
monolingual corpora exhibit linguistic structure
integrate language model trained on monolingual corpora into a NMTmodel suffering from low resources or domain restriction
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models3 / 35
Model - based on [Bahdanau et al., 2014]
encoder-decoder framework
Figure 1: overview of the model [Bahdanau et al., 2014]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models4 / 35
Encoder
bidirectional RNN
sequence of annotation vectors is concatenation of pairs of hiddenstates
hTj = [←−h T
j ;−→h T
j ]
hj encodes information about j-word w.r.t. all of the surroundingwords in the sentence
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models5 / 35
Decoder
emulates searching through a source sentence during decoding
at each timestep t
compute st = fr (st−1, yt−1, ct)computes context vector ct (expected annotation)
ct =Tx∑j=1
αtjhj
with
αtj =exp(etj)
Tx∑k=1
exp(etk)
; etj = a(st−1, hj , yt−1)
a: soft-alignment model (feedforward neural network)α: attention
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models6 / 35
Decoder
compute probability of target word yt with deep output layer
p(yt | y<t) ∝ exp(yTt (Wo fo(sTMt , yt−1, ct) + bo))
yt : k-dimensional, one-hot encoded vector indicating one of the wordsin target vocabulary
for large target vocabularies: importance sampling technique[Jean et al., 2014]
Wo ∈ RKy×l : weight matrixfo : neural network with two-way maxout non-linearitybo : bias
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models7 / 35
Sampling technique
partition training corpus and define small subset V ′ of unique targetwords for each partition
at each update only the vectors associated with the sampled words inV ′ and the correct word are updated
equivalent to
p(yt | y<t , x) =exp{yTt φ(yt−1, st , ct) + bt}∑
yk∈V ′exp{yTk φ(yt−1, st , ct) + bk}
approximates exact output probability
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models8 / 35
Maxout Networks
characterized by the Maxout activation function
Generalization of ReLu: hi (x) = maxj∈[1,k]zij
zij = xTW ...ij + bij ; W ∈ Rd×m×k
piecewise linear approximation to an arbitrary convex function
locally linear almost anywhere
no sparse representation can be produced
many learned parameters - best used in combination with dropout
Figure 2: maxout activation function implementing/approximating differentfunctions [Goodfellow et al., 2013]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models9 / 35
Training
Train NMT model with bilingual corpus of pairs (x (n), y (n))
maxθ1
N=
N∑n=1
log pθ(y (n) | x (n))
θ: set of trainable parameters
Model setup
RNNLM as language model: set ct = 0NMT as described abovepre-train RNNLM and NMT separately
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models10 / 35
Integrating the language model - Fusions
Figure 3: Shallow and Deep Fusion [Gulcehre et al., 2015]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models11 / 35
Shallow Fusion
at each timestep t
NMT: compute score of possible next words for all hypotheses {y (i)<t−1}
sort new hypotheses (old hypotheses with new word added)
select top K hypotheses as candidates {y(i)≤t}i=1,...,K
rescore hypotheses with weighted sum of words (add score of new word)
log p(yt = k) = log pTM(yt = k) + βlog pLM(yt = k)
β: hyper-parameter
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models12 / 35
Deep Fusion
integrate RNNLM and decoder of NMT by concatenating their hiddenstates
fine-tune model to use both hidden states by tuning only outputparameters
new input of deep output layer
p(yt | y<t , x) ∝ exp(yt(Wo fo(sTMt , sLMt , yt−1, ct) + bo))
keeps learned monolingual structure of LM intact
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models13 / 35
Deep Fusion - Balancing LM and TM
use controller mechanism to dynamically weight LM and TM models
gt = σ(vTg sLMt + bg )
multiply controller output with hidden state of LM to controlmagnitude of LM signal
decoder only regards LM when deemed necessary
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models14 / 35
Corpora
parallel datasets
Table 1: datasets used [Gulcehre et al., 2015]
Zh-En: training on SMS/CHAT, test on conversational speechall others: close domains
monolingual dataset: English gigaword corpus
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models15 / 35
Results - Translation Tasks
(a) Tr-En for each set
(b) Zh-En, PB:phrase-based, HPB:hierarchical phrase based
(c) De-En and Cs-Entranslation tasks
Table 2: translation results on parallel corpora [Gulcehre et al., 2015]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models16 / 35
Results - RNNLM perplexity on development set
Table 3: Perplexity of RNNLM on dev. set and g [Gulcehre et al., 2015]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models17 / 35
Conclusions
Fusion
generally improves performance, regardless of the size of the parallelcorporaDeep Fusion achieves bigger improvements than Shallow FusionDeep Fusion makes a model incorporate the LM information as needed
domain similarity
divergence in domains generally hurts model performanceLM is most useful when domains of parallel and monolingual corporaare close
controller mechanism
makes model more robust to domain mismatchmost active on De-En/Cs-En as LM was able to model the targetsentences best here
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models18 / 35
Motivation - [Chorowski and Jaitly, 2016]
DNN Automatic Speech Recognition (ASR) systems suffer fromoverconfidence
peaked probability distribution prevents model from finding sensiblealternativesharms learning deep layers as gradient approximates 0 on correctcharacters: [yi = c]− p(yi | y<i , x)
integrating LM to combat this results in incomplete transcripts
idea: better integrate language models
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models19 / 35
Model - based on [Chan et al., 2015]
Figure 4: Listen, Attend and Spell model [Chan et al., 2015]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models20 / 35
Model
encodes into a space-delimited sequence of characters directly usingthe encoder - decoder structure
Listener: pyramid BLSTM (pBLSTM)
hjt = pBLSTM(hj−1t−1, [hj−12t , hj−12i+1])
Speller
attention-based LSTM transducer
ci = AttentionContext(st , h)
st = RNN(st−1, yt−1, ct−1)
P(yt | x , y<t) = CharacterDistribution(st , ct)
RNN: 2-layer LSTMCharacterDistribution: MLP with softmaxbeam search to find character sequence y∗
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models21 / 35
Model - AttentionContext for decoder timestep t
ct =∑u
αt,uhu
αt,u =exp(et,u)∑
uexp(et , u)
et,u = wT tanh(Wst−1 + Vhj + Uft,j + b)
ft = F ∗ αt−1
α very sharp probability distribution (attention)
et,u: location aware scoring mechanism [Chorowski et al., 2015]
employs convolutional filters over the previous attention weightsextract k ft,j ∈ Rk vectors with F ∈ Rk×r for every position j of αt−1
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models22 / 35
Training
minimize cross-entropy between ground-truth characters and modelpredictions
criterion
loss(y , x) = −∑i
∑c
T (yi , c)log(p)(yt | y<t , x)
T (yi , c): target-label function, implemented differently for each model
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models23 / 35
Language Model Integration - based on[Gulcehre et al., 2015] Shallow Fusion
extends beam search with a language modelling term
y∗ = argminy − log(p(y | x))− λlog(pLM(g))− γcoverage
λ, γ: tunable parameters
coverage: term promoting longer transcript to avoid incompletetranscripts
overconfidence drastically influences −log(p(y | x)) if deviation fromnetwork’s best guess
using a non-heuristic measurements does not change results, yet is farmore difficult to compute
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models24 / 35
Experimental Setup
dataset
Wall Street Journal datasetextracted and augmented 80-dimensional mel-scale filterbanksvocabulary consists of lower case letters, space, apostrophe, noisemarker, SOS, EOS
LM
extended-vocabulary trigram LM constructed with Kaldi[Povey et al., 2011] and spelling lexicon [Miao et al., 2015]
baseline
Listener: 4 BLSTM, 3 time-pooling layersSpeller: LSTM, character embeddings, attention MLPbeam size: 10
final model: baseline + LM
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models25 / 35
Improvement tests - Overconfidence
problem
good prediction of speller results in little training signal through theattention mechanism to the listenernext predicted character may have low accuracybeam search’s usefulness is greatly reduced
Solution 1: Temperature Parameter T
p(yi ) =exp(li/T )∑jexp(lj/T )
increase T : more uniform distribution; more deletion errors; betterbeam search resultsconstrain emission of EOS
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models26 / 35
Improvement tests - Overconfidence: Label Smoothing
variants of label smoothing
unigram label smoothingneighborhood smoothing
model is regularized
higher entropy of network predictions
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models27 / 35
Improvement tests - Overconfidence
Figure 5: influence of temperature parameter on softmax[Chorowski and Jaitly, 2016]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models28 / 35
Improvement tests - Partial Transcripts Problem
problem: using LM + beam search results in incomplete transcriptssolution: extend beam search criterion with coverage term to promotelong transcripts
coverage =∑j
[∑i
αij > τ ]
prevents looping over the utterance
Figure 6: impact of techniques to promote long sequences[Chorowski and Jaitly, 2016]
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models29 / 35
Results
(a) baseline configuration and results (b) trigram language model
Table 4: results on WSJ [Chorowski and Jaitly, 2016]
competitive with other non-HMM techniques at the time
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models30 / 35
Conclusion
integrating LM into sequence-to-sequence models is competitive withother approaches, if done successfully
integrating a LM into a NN has proven to be difficult
LM is especially useful in cases the NN itself is uncertain
ablation study to see the actual effect of the LM on the results isadvisable
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models31 / 35
Critique and Discussion
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models32 / 35
References I
[Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014).
Neural machine translation by jointly learning to align and translate.
cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation.
[Chan et al., 2015] Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015).
Listen, attend and spell.
CoRR, abs/1508.01211.
[Chorowski et al., 2015] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., andBengio, Y. (2015).
Attention-based models for speech recognition.
CoRR, abs/1506.07503.
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models33 / 35
References II
[Chorowski and Jaitly, 2016] Chorowski, J. and Jaitly, N. (2016).
Towards better decoding and language model integration in sequence tosequence models.
CoRR, abs/1612.02695.
[Goodfellow et al., 2013] Goodfellow, I. J., Warde-Farley, D., Mirza, M.,Courville, A., and Bengio, Y. (2013).
Maxout networks.
[Gulcehre et al., 2015] Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin,H., Bougares, F., Schwenk, H., and Bengio, Y. (2015).
On using monolingual corpora in neural machine translation.
CoRR, abs/1503.03535.
[Jean et al., 2014] Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014).
On using very large target vocabulary for neural machine translation.
CoRR, abs/1412.2007.
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models34 / 35
References III
[Miao et al., 2015] Miao, Y., Gowayyed, M., and Metze, F. (2015).
EESEN: end-to-end speech recognition using deep RNN models and wfst-baseddecoding.
CoRR, abs/1507.08240.
[Povey et al., 2011] Povey, D., Ghoshal, A., Boulianne, G., Goel, N.,Hannemann, M., Qian, Y., Schwarz, P., and Stemmer, G. (2011).
The kaldi speech recognition toolkit.
In In IEEE 2011 workshop.
Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models35 / 35