Deep Learning for Machine Translation
-
Upload
matiss-rikters -
Category
Technology
-
view
1.399 -
download
0
Transcript of Deep Learning for Machine Translation
![Page 1: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/1.jpg)
![Page 2: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/2.jpg)
What makes learning deep?
![Page 3: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/3.jpg)
![Page 4: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/4.jpg)
Schedule• Kevin Duh - Fundamentals for DL4MT I & II• Lab 1 - Prep and setup. Compare logistic regression, MLP,
and stacked auto-encoders on data• Lab 2 - word embedding (SGNS), visualization.
• Hermann Ney - Neural LMs and TMs for SMT I & II• Lab 1 - Rescore n-best lists using RNN LM• Lab 2 - n-best rescoring using uni/bidirectional translation and joint models
• Kyunghyun Cho - Neural MT I & II• Lab 1 - Data Preparation: Basic preprocessing; Encoder-Decoder with Theano
(without attention)• Lab 2 - Attention-based Encoder-Decoder with Theano
![Page 5: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/5.jpg)
Neural Networks (NN) – brief introduction
• An NN consists of:• Multiple layers (an input layer, zero
or more hidden layers, and an output layer) that consist of a set of neurones (xi, hj, and y)
• Interconnections between nodes of different layers that have weights assigned (wij and wj)
• Activation functions for each neurone that convert weighted input of neurons into output values
• Deep neural networks –NNs with one or more hidden layers
![Page 6: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/6.jpg)
Activation functions• Many different
types• Most common
functions in NLP – logistic sigmoid and hyperbolic tangent (because of their non-linearity)
http://blog.sciencenet.cn/blog-457187-878461.html
![Page 7: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/7.jpg)
Training• Neural networks are trained
with the backpropagation algorithm• Aim – reduce cost/error of the
model by iteratively performing a forward pass, calculating the error, and adjusting weights based on the “direction” (read - derivative) of the error. The error is “backpropagated” from the output layer back to the input layer
• A long description with a lot of theory: http://neuralnetworksanddeeplearning.com/chap2.html
![Page 8: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/8.jpg)
Word Embeddings• A continuous space representation of words• Multi-dimensional vectors• Decimal values
• By-product of neural networks where words are vectorised• Skip-gram models• neural network language models• neural machine translation models• etc.
![Page 9: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/9.jpg)
Word Embeddings – Skip-Gram model• Input – word (one of N vector)• Output – context of the word• Embeddings – the trained weight
matrix W between the input layer and the hidden layer• Each row represents the embedding
of a single word
• Implementation: word2vec
http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
![Page 10: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/10.jpg)
Word embeddings are different• The distributional representation of words has to be trained for specific tasks
(e.g., Skip-gram word embeddings are not good for translation)• Similar words (according to cosine similarity) of the given words using different
word embedding models
http://arxiv.org/pdf/1412.6448.pdf
![Page 11: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/11.jpg)
Feedforward Neural Net Language Model (NNLM)• Y. Bengio, R.
Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.
![Page 12: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/12.jpg)
Feedforward Neural Net Language Model (NNLM)• The same,
but with a simpler figure
/ http://www.cs.cmu.edu/~mfaruqui/talks/nn-clab.pdf /
![Page 13: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/13.jpg)
What NNLMs are (supposedly) good at(… what n-gram models never will)?
![Page 14: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/14.jpg)
Language Modelling and Machine Translation using Neural NetworksHermann Neyhttp://ej.uz/NNLM
![Page 15: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/15.jpg)
Language Modelling• Conventional Language Modelling
• Measure the quality of an LM with perplexity
• Problem: most of the events are never seen in training data
![Page 16: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/16.jpg)
Language Modelling Using Neural Networks• non ANN: count models (Markov chain):• limited history of predecessor words• smooth relative frequencies
• feedforward multi-layer perceptron (FF MLP):• limited history too• use predecessor words as input to MLP
• recurrent neural networks (RNN):• advantage: unlimited history
![Page 17: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/17.jpg)
![Page 18: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/18.jpg)
Recurrent Neural Net Language Model (RNNLM)
![Page 19: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/19.jpg)
Experiments Results with JRC-ACQUIS EN-LV
Approach Perplexity CPU time Size
5-gram count language model 48.0376 5 minutes + 2 minutes (binarize) 1118MB
4-gram Feedforward Neural Net Language Model with 2 layers• 1000 wrord classes;• batch-size 64; • learning-rate 5e-3;• 200 nodes per input word, and a
subsequent layer of 200 nodes with a sigmoid activation function.
126.9841 1 week 43MB
![Page 20: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/20.jpg)
![Page 21: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/21.jpg)
Practicalities of ANN LM Training (Implementation and Software)• no regularization, no momentum term, no drop-out (so far!)• no pre-training (so far!)• vocabulary reduction: remove singletons, or keep most frequent words• random initialization of weights: Gaussian of mean 0, variance 0.01• training criterion: cross-entropy (perplexity)• stopping: cross-validation, perplexity on a development text• initial learning rate: typically between 1 · e−3 and 1 · e−2• learning rate: halved when the dev perplexity is worse than the best of previous
epochs• use of mini-batches: 4 to 64• low level implementation in C++• GPUs (typically) not used for the results presented in this talk
![Page 22: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/22.jpg)
Language Modelling Using Neural Networks labExercice Sheet: http://ej.uz/NNLMlabhttps://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php• Data preparation• Training on small data• Tuning of hyper-parameters• Modifying the network architecture
![Page 23: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/23.jpg)
Conventional SMT
![Page 24: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/24.jpg)
Translation Model based on FF MLP
![Page 25: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/25.jpg)
Joint Language and Translation Model based on Feedforward MLP
![Page 26: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/26.jpg)
Experiments Results
![Page 27: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/27.jpg)
Experiments Results
![Page 28: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/28.jpg)
Experiments Results Syntax-based Multi-System Hybrid Translator + NNLM on JRC-ACQUIS EN-LV
Approach BLEUMHyT with 5-gram count language model 22.69
SyMHyT with 5-gram count language model 24.72
SyMHyT with 4-gram Feedforward Neural Net Language Model with 2 layers 23.71
![Page 29: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/29.jpg)
Machine Translation using Neural Networks lab• Exercice Sheet: http://ej.uz/NNMTlab• Part 1: N-best Reranking using Neural Network Language Models
• Obtain the new 1-best hypotheses• Measure the translation quality• Do reranking and compare to the results obtained before reranking.
• Part 2: Neural Network Translation Models• Train a unidirectional translation model
• Train a unidirectional joint model• Train a bidirectional translation model
• Train a bidirectional joint model• Try to obtain better perplexity values by changing the batch size and learning rate
• Part 3: N-best Reranking using Neural Network Translation Models• Apply rescoring using each of the unidirectional and bidirectional translation and joint models• Optimize the model weights with MERT to achieve the best BLEU score on the dev dataset• Evaluate the translation hypotheses for each of the rescoring experiments
![Page 30: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/30.jpg)
Neural Machine Translation• Encoder-decoder model
https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
![Page 31: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/31.jpg)
Neural Machine Translation
![Page 32: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/32.jpg)
Neural Machine Translation• Encoder-decoder model• (a) – encoder• (b) - decoder
https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/
![Page 33: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/33.jpg)
Neural Machine Translation• Bi-directional recurrent neural networks for attention-based models
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
![Page 34: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/34.jpg)
Neural Machine Translation• Attention-based models• Due to its sequential nature, a
recurrent neural network tends to remember recent symbols better• The attention mechanism
allows to focus at each time step at the relevant symbols by selecting the appropriate vector that summarises the input sentence
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
![Page 35: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/35.jpg)
Attention-based encoder-decoder NMT• English-Latvian
• Vocabulary – 30K• Embedding dimensions – 400• Hidden layer dimensions – 1,024• Batch size - 14• Training corpus – DGT-TM (2,401,815 unique sentences)• Trainined on an NVIDIA GeForce GTX 960 (2GB) GPU• Training time - ~4 days and 2 hours (crashed due to out of memory exception)
• But ... Luckily it saves models iteratively
• During training uses ~40GB of virtual memory• Translation time of 512 sentence test set
(however, it includes also the model loading time)• NMT: 19 minutes and 3 seconds (translation with CPU on 6 cores)• LetsMT: 1 minute and 39 seconds (translation with CPU on 1 core)
https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2
![Page 36: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/36.jpg)
Attention-based encoder-decoder NMT• Comparison with LetsMT (LetsMT – 13.93 BLEU, NMT – 12.42 BLEU)
• Translations more fluent (even if not always correct «according to reference»)
![Page 37: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/37.jpg)
Attention-based encoder-decoder NMT• Unknown words are a problem
![Page 38: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/38.jpg)
Attention-based encoder-decoder NMT
• Sometimes the context around unknown words is surprisingly good
![Page 39: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/39.jpg)
Attention-based encoder-decoder NMT
• Sometimes the NMT creates a translation that is (probably) equally as good as the reference
![Page 40: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/40.jpg)
Attention-based encoder-decoder NMT
• However, sometimes the translation is also bad (total nonsense)
![Page 41: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/41.jpg)
Attention-based encoder-decoder NMT• Named entities that are listed with commas can cause issues
![Page 42: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/42.jpg)
Attention-based encoder-decoder NMT• Word alignments • What is different from
Giza++?• LetsMT translation:šīs paradigma pamatelements ir jaunā struktūra informācijas apstrādes sistēmu .
![Page 43: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/43.jpg)
Attention-based encoder-decoder NMT• Model trained on 50K vocabulary and a bath size of 12• After 300,000 updates (or 3,600,000 observed sentences)• I.e., model is not yet fully trained• Results:
• LetsMT – 13.93 BLEU, NMT – 12.48 BLEU(+0.06)• Not good, but it may improve since the model has not finished training…
• After 520,000 updates (6,240,000 observed sentences)• LetsMT – 13.93 BLEU, NMT – 11.88 BLEU(-0.54)
![Page 44: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/44.jpg)
Attention-based encoder-decoder NMT
• Lessons learned (from the tiny but long experiments):• You need to have a «good» GPU (6GB GDDR may not be enough for systems of a decent
vocabulary size)• A 2GB card will not allow building models with a vocabulary that is larger than 30-50K• The «good» GPUs are expensive (>1K€)• Only Nvidia GPUs are currently usable (the existing libraries are built/tuned for CUDA), OpenCL is an
alternative that is under-supported• If training with a GPU takes up to a week, training with a CPU is a no go• 30K is miles away from a decent vocabulary• You need to have means to handle unknown words• From the translation quality positive tendencies are evident, but an experiment with a
more decent data set and a larger vocabulary is necessary to make more justified judgements
![Page 45: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/45.jpg)
Only
€ 5.979
![Page 46: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/46.jpg)
TensorFlowhttps://www.tensorflow.org/
• Deep Flexibility• True Portability• Connect Research and Production• Auto-Differentiation• Language Options• Maximize Performance
![Page 47: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/47.jpg)
Theanohttp://deeplearning.net/software/theano/
• Built for python• Tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.• Transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)• Efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.• Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.• Dynamic C code generation – Evaluate expressions faster.• Extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
• The EN-LV NMT model was trained using Theano
• Speed comparison of different NN libraries: https://github.com/soumith/convnet-benchmarks
![Page 48: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/48.jpg)
![Page 49: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/49.jpg)
![Page 50: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/50.jpg)
![Page 51: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/51.jpg)
![Page 52: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/52.jpg)
![Page 53: Deep Learning for Machine Translation](https://reader034.fdocuments.us/reader034/viewer/2022050614/58808bd21a28ab35718b6b27/html5/thumbnails/53.jpg)