K translate - Baltic DBIS2016
-
Upload
matiss-rikters -
Category
Technology
-
view
116 -
download
0
Transcript of K translate - Baltic DBIS2016
![Page 1: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/1.jpg)
K-Translate
Interactive Multi-System
Machine TranslationMatīss Rikters
12th International Baltic Conference on Databases and Information Systems
Rīga, Latvia
July 5, 2016
![Page 2: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/2.jpg)
Contents
Machine Translation
Hybrid MT
Multi-System Hybrid MT
Combining of translations
Combining full whole translations
Combining translations of sentence chunks
Combining translations of linguistically motivated chunks
Interactive Multi-System Machine Translation
Future plans
![Page 3: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/3.jpg)
Machine Translation
Translation services
Google Translate, Bing Translator, ...
Translation of extensive documents
Localisation
Ebay, Adobe, ...
Fight against terrorism
Speech to speech translation
Skype, ...
![Page 4: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/4.jpg)
Machine Translation
MT
RBMT SMT HMT NMT
![Page 5: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/5.jpg)
Hybrid Machine Translation
Statistical rule generation
Rules for RBMT systems are generated from training corpora
Multi-pass
Process data through RBMT first, and then through SMT
Multi-System hybrid MT
Multiple MT systems run in parallel
![Page 6: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/6.jpg)
Multi-System Hybrid MT
Related work:
SMT + RBMT (Ahsan and Kolachina, 2010)
Confusion Networks (Barrault, 2010)
+ Neural Network Model (Freitag et al., 2015)
SMT + EBMT + TM + NE (Santanu et al., 2014)
Recursive sentence decomposition (Mellebeek et al., 2006)
![Page 7: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/7.jpg)
Combining full whole translations
Translate the full input sentence with multiple MT systems
Choose the best translation as the output
Combining Translations
![Page 8: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/8.jpg)
Combining full whole translations
Translate the full input sentence with multiple MT systems
Choose the best translation as the output
Combining translations of sentence chunks
Split the sentence into smaller chunks
The chunks are the top level subtrees of the syntax tree of the sentence
Translate each chunk with multiple MT systems
Choose the best translated chunks and combine them
Combining Translations
![Page 9: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/9.jpg)
Combining full whole translations
Teikumu dalīšana tekstvienībās
Tulkošana ar tiešsaistes MT API
Google Translate Bing Translator LetsMT
Labākā tulkojuma izvēle
Tulkojuma izvade
Sentence tokenization
Translation with the online MT APIs
Selection of
the best translation
Output
![Page 10: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/10.jpg)
Combining full whole translations
Choosing the best translation:
KenLM (Heafield, 2011) calculates probabilities based on the observed entry with
longest matching history 𝑤𝑓𝑛:
𝑝 𝑤𝑛 𝑤1𝑛−1 = 𝑝 𝑤𝑛 𝑤𝑓
𝑛−1 ෑ
𝑖=1
𝑓−1
𝑏(𝑤𝑖𝑛−1)
where the probability 𝑝 𝑤𝑛 𝑤𝑓𝑛−1 and backoff penalties 𝑏(𝑤𝑖
𝑛−1) are given by an
already-estimated language model. Perplexity is then calculated using this
probability: where given an unknown probability distribution p
and a proposed probability model q, it is evaluated by determining how well it
predicts a separate test sample x1, x2... xN drawn from p.
![Page 11: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/11.jpg)
Combining full whole translations
Choosing the best translation:
A 5-gram language model was trained with
KenLM
JRC-Acquis corpus v. 3.0 (Steinberger, 2006) - 1.4 million Latvian legal
domain sentences
Sentences are scored with the query program that comes with KenLM
![Page 12: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/12.jpg)
Combining full whole translations
Choosing the best translation:
A 5-gram language model was trained with
KenLM
JRC-Acquis corpus v. 3.0 (Steinberger, 2006) - 1.4 million Latvian legal
domain sentences
Sentences are scored with the query program that comes with KenLM
Test data
1581 random sentences from the JRC-Acquis corpus
Tested with the ACCURAT balanced evaluation corpus - 512
general domain sentences (Skadiņš et al., 2010), but
the results were not as good
![Page 13: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/13.jpg)
Combining full whole translations
System BLEU
Hybrid selection
Google Bing LetsMT Equal
Google Translate 16.92 100 % - - -
Bing Translator 17.16 - 100 % - -
LetsMT 28.27 - - 100 % -
Hybrid Google + Bing 17.28 50.09 % 45.03 % - 4.88 %
Hybrid Google + LetsMT 22.89 46.17 % - 48.39 % 5.44 %
Hybrid LetsMT + Bing 22.83 - 45.35 % 49.84 % 4.81 %
Hybrid Google + Bing + LetsMT 21.08 28.93 % 34.31 % 33.98 % 2.78 %
May 2015 (Rikters, 2015)
![Page 14: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/14.jpg)
Combining translated chunks of sentences
Teikumu dalīšana tekstvienībās
Tulkošana ar tiešsaistes MT API
Google Translate
Bing Translator
LetsMTLabāko fragmentu izvēle
Tulkojumu izvade
Teikumu sadalīšana fragmentos
Sintaktiskā analīze
Teikumu apvienošana
Sentence tokenization
Translation with the online MT APIs
Selection of
the best chunks
Output
Syntactic analysis
Sentence chunking
Sentence recomposition
![Page 15: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/15.jpg)
Syntactic analysis:
Berkeley Parser (Petrov et al., 2006)
Sentences are split into chunks from the top level subtrees
of the syntax tree
Combining translated chunks of sentences
![Page 16: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/16.jpg)
Syntactic analysis:
Berkeley Parser (Petrov et al., 2006)
Sentences are split into chunks from the top level subtrees
of the syntax tree
Selection of the best chunk:
5-gram LM trained with KenLM and the JRC-Acquis corpus
Sentences are scored with the query program that comes with KenLM
Combining translated chunks of sentences
![Page 17: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/17.jpg)
Syntactic analysis:
Berkeley Parser (Petrov et al., 2006)
Sentences are split into chunks from the top level subtrees
of the syntax tree
Selection of the best chunk:
5-gram LM trained with KenLM and the JRC-Acquis corpus
Sentences are scored with the query program that comes with KenLM
Test data
1581 random sentences from the JRC-Acquis corpus
Tested with the ACCURAT balanced evaluation corpus,
but the results were not as good
Combining translated chunks of sentences
![Page 18: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/18.jpg)
System
BLEU Hybrid selection
MSMT SyMHyT Google Bing LetsMT
Google Translate 18.09 100% - -
Bing Translator 18.87 - 100% -
LetsMT 30.28 - - 100%
Hybrid Google + Bing 18.73 21.27 74% 26% -
Hybrid Google + LetsMT 24.50 26.24 25% - 75%
Hybrid LetsMT + Bing 24.66 26.63 - 24% 76%
Hybrid Google + Bing + LetsMT 22.69 24.72 17% 18% 65%
September 2015 (Rikters and Skadiņa, 2016-1)
Combining translated chunks of sentences
![Page 19: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/19.jpg)
Combining translations of
linguistically motivated chunks
An advanced approach to chunking
Traverse the syntax tree bottom up, from right to left
Add a word to the current chunk if
The current chunk is not too long (sentence word count / 4)
The word is non-alphabetic or only one symbol long
The word begins with a genitive phrase («of »)
Otherwise, initialize a new chunk with the word
In case when chunking results in too many chunks, repeat the process, allowing
more (than sentence word count / 4) words in a chunk
![Page 20: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/20.jpg)
An advanced approach to chunking
Traverse the syntax tree bottom up, from right to left
Add a word to the current chunk if
The current chunk is not too long (sentence word count / 4)
The word is non-alphabetic or only one symbol long
The word begins with a genitive phrase («of »)
Otherwise, initialize a new chunk with the word
In case when chunking results in too many chunks, repeat the process, allowing
more (than sentence word count / 4) words in a chunk
Changes in the MT API systems
LetsMT API temporarily replaced with Hugo.lv API
Added Yandex API
Combining translations of
linguistically motivated chunks
![Page 21: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/21.jpg)
Combining translations of
linguistically motivated chunks
![Page 22: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/22.jpg)
Selection of the best translation:
6-gram and 12-gram LMs trained with
KenLM
JRC-Acquis corpus v. 3.0
DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million Latvian
legal domain sentences
Sentences scored with the query program from KenLM
Combining translations of
linguistically motivated chunks
![Page 23: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/23.jpg)
Selection of the best translation:
6-gram and 12-gram LMs trained with
KenLM
JRC-Acquis corpus v. 3.0
DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million Latvian
legal domain sentences
Sentences scored with the query program from KenLM
Test data
1581 random sentences from the JRC-Acquis corpus
ACCURAT balanced evaluation corpus
Combining translations of
linguistically motivated chunks
![Page 24: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/24.jpg)
Sentence chunks with SyMHyT Sentence chunks with ChunkMT
• Recently
• there
• has been an increased interest in the
automated discovery of equivalent
expressions in different languages
• .
• Recently there has been an increased
interest
• in the automated discovery of
equivalent expressions
• in different languages .
Combining translations of
linguistically motivated chunks
![Page 25: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/25.jpg)
Combining translations of
linguistically motivated chunks
![Page 26: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/26.jpg)
![Page 27: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/27.jpg)
System BLEU Equal Bing Google Hugo Yandex
BLEU - - 17.43 17.73 17.14 16.04
MSMT - Google + Bing 17.70 7.25% 43.85% 48.90% - -
MSMT- Google + Bing + LetsMT 17.63 3.55% 33.71% 30.76% 31.98% -
SyMHyT - Google + Bing 17.95 4.11% 19.46% 76.43% - -
SyMHyT - Google + Bing + LetsMT 17.30 3.88% 15.23% 19.48% 61.41% -
ChunkMT - Google + Bing 18.29 22.75% 39.10% 38.15% - -
ChunkMT – all four 19.21 7.36% 30.01% 19.47% 32.25% 10.91%
January 2016 (Rikters and Skadiņa, 2016-2)
Combining translations of
linguistically motivated chunks
![Page 28: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/28.jpg)
• Matīss Rikters and Inguna Skadiņa
"Combining machine translated sentence chunks from multiple MT systems" 17th International Conference on Intelligent Text Processing and Computational Linguistics, 2016
• Matīss Rikters and Inguna Skadiņa
"Syntax-based multi-system machine translation" The 10th edition of the Language Resources and Evaluation Conference, 2016
• Matīss Rikters
"Multi-system machine translation using online APIs for English-Latvian" ACL 2015 Fourth Workshop on Hybrid Approaches to Translation, 2015
Related publications
![Page 29: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/29.jpg)
Start page
Translate with online systems
Input translations to combine
Input translated
chunks
Settings
Translation results
Input source sentence
Input source sentence
Interactive multi-system machine translation
![Page 30: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/30.jpg)
Interactive multi-system machine translation
K-translate - interactive multi-system machine translation
About the same as ChunkMT but with a nice user interface
Draws a syntax tree with chunks highlighted
Designates which chunks where chosen from which system
Provides a confidence score for the choices
Allows using online APIs or user provided machine translations
Comes with resources for translating between English, French, German and Latvian
Can be used in a web browser
![Page 31: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/31.jpg)
Translate with
online systems
![Page 32: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/32.jpg)
Translate with
online systems
![Page 33: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/33.jpg)
Input translations
to combine
![Page 34: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/34.jpg)
Input translations
to combine
![Page 35: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/35.jpg)
Input translations
to combine
![Page 36: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/36.jpg)
Input translations
to combine
![Page 37: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/37.jpg)
Code on GitHub
http://ej.uz/KTranslate
http://ej.uz/ChunkMT
http://ej.uz/SyMHyT
http://ej.uz/MSMT
http://ej.uz/chunker
![Page 38: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/38.jpg)
Future work
More enhancements for the chunking step
Add special processing of multi-word expressions (MWEs)
Add support for other types of parsers
SyntaxNet: Neural Models of Syntax (Andor et al., 2016)
MaltParser (Nivre et al., 2006)
Add support for other types of LMs
POS tag + lemma
Recurrent Neural Network Language Model(Mikolov et al., 2010)
Continuous Space Language Model(Schwenk et al., 2006)
Character-Aware Neural Language Model(Kim et al., 2015)
Choose the best translation candidate with MT quality estimation
QuEst++ (Specia et al., 2015)
SHEF-NN (Shah et al., 2015)
Future ideas
![Page 39: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/39.jpg)
References Ahsan, A., and P. Kolachina. "Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth Conference of the Association for
Machine Translation in the Americas." Denver, Colorado (2010).
Barrault, Loïc. "MANY: Open source machine translation system combination." The Prague Bulletin of Mathematical Linguistics 93 (2010): 147-155.
Santanu, Pal, et al. "USAAR-DCU Hybrid Machine Translation System for ICON 2014" The Eleventh International Conference on Natural Language Processing. , 2014.
Mellebeek, Bart, et al. "Multi-engine machine translation by recursive sentence decomposition." (2006).
Heafield, Kenneth. "KenLM: Faster and smaller language model queries." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for
Computational Linguistics, 2011.
Steinberger, Ralf, et al. "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages." arXiv preprint cs/0609058 (2006).
Petrov, Slav, et al. "Learning accurate, compact, and interpretable tree annotation." Proceedings of the 21st International Conference on Computational Linguistics
and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
Steinberger, Ralf, et al. "Dgt-tm: A freely available translation memory in 22 languages." arXiv preprint arXiv:1309.5226 (2013).
Raivis Skadiņš, Kārlis Goba, Valters Šics. 2010. Improving SMT for Baltic Languages with Factored Models. Proceedings of the Fourth International Conference Baltic
HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192. , 125-132.
Andor, Daniel, et al. "Globally normalized transition-based neural networks." arXiv preprint arXiv:1603.06042 (2016).
Nivre, Joakim, Johan Hall, and Jens Nilsson. "Maltparser: A data-driven parser-generator for dependency parsing." Proceedings of LREC. Vol. 6. 2006.
Mikolov, Tomas, et al. "Recurrent neural network based language model." INTERSPEECH. Vol. 2. 2010.
Schwenk, Holger, Daniel Dchelotte, and Jean-Luc Gauvain. "Continuous space language models for statistical machine translation." Proceedings of the COLING/ACL on
Main conference poster sessions. Association for Computational Linguistics, 2006.
Kim, Yoon, et al. "Character-aware neural language models." arXiv preprint arXiv:1508.06615 (2015).
Specia, Lucia, G. Paetzold, and Carolina Scarton. "Multi-level Translation Quality Prediction with QuEst++." 53rd Annual Meeting of the Association for Computational
Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: System
Demonstrations. 2015.
Shah, Kashif, et al. "SHEF-NN: Translation Quality Estimation with Neural Networks." Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.
![Page 40: K translate - Baltic DBIS2016](https://reader031.fdocuments.us/reader031/viewer/2022020410/58808bbb1a28ab35718b6acb/html5/thumbnails/40.jpg)