CS11-737 Multilingual NLP Data-based Strategies to Low...

CS11-737 Multilingual NLP

Data-based Strategies to Low-resource MT

Graham Neubig

Sitehttp://demo.clab.cs.cmu.edu/11737fa20/

Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.

Data Challenges in Low-resource MT• MT of high-resource languages (HRLs) with large

parallel corpora → relatively good translations

HRL TRG

LRL TRG

• MT of low-resource languages (LRLs) with small parallel corpora → nonsense!

A Concrete Example

source - Atam balaca boz radiosunda BBC Xəbərlərinə qulaq asırdı.

translation - So I’m going to became a lot of people.

reference - My father was listening to BBC News on his small , gray radio.

A system that is trained with 5000 sentence pairs on Azerbaijani and English ?

Does not convey the correct meaning at all.

Multilingual Training Approaches● Transfer HRL to LRL (Zoph et

al., 2016; Nguyen and Chiang, 2017)

HRL TRG

LRL TRG

MT Syste

● Joint training with LRL and HRL parallel data (Johnson et al., 2017; Neubig and Hu, 2018)

HRL TRG

LRL TRGconcatenate

MT System

● Problem: Suboptimal lexical/syntactic sharing.● Problem: Can't leverage monolingual data.

Data AugmentationAvailable Resources Augmented Data

LRLTRG

Data Augmentation 101:Back Translation

LRL-M TRG-M

TRG -> LRL

Back Translation Idea

Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." ACL 2016.

LRLTRG-L

LRL-M TRG-M2. Back-translate TRG→LRL

1. Train TRG→LRL System

TRG→ LRL

3. Train LRL→TRG

LRL→ TRG

● Some degree of error in source data is permissible!

How to Generate Translations● How to generate translations?● Beam search (Sennrich et al. 2016)

● Select the highest scoring output● Higher quality, but lower diversity, potential for

data bias● Sampling (Edunov et al. 2018)

● Randomly sample from back-translation model● Lower overall quality, but higher diversity

● Sampling has shown to be more effective overallUnderstanding Back-Translation at Scale. Sergey Edunov, Myle Ott, Michael Auli, David Grangier. EMNLP 2018.

Iterative Back-translation

Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, Trevor Cohn. "Iterative Back-Translation for Neural Machine Translation" WNGT 2018.

LRLTRG

LRL TRG4. Back-translate TRG-LRL

2. Forward-translate LRL-TRG

LRLTRG

LRL→ TRG

5. Final LRL→TRG System

TRG→ LRL

3. Train TRG→LRL System

1. Train LRL→TRG System

LRL→ TRG

Back Translation Issues• Back-translation fails in low-resource languages or

domains

• Use other high-resource languages

• Combine with monolingual data (maybe with denoising objectives, covered in following class)

• Perform other varieties of rule-based augmentation

Using HRLs in Augmentation

Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.

English -> HRL Augmentation● Problem: TRG-LRL back-translation might be

low quality

● Idea: also back-translate into HRL ○ more sentence pairs○ vocabulary sharing of source-side○ syntactic similarity of source-side○ improves target-side LM

HRL-M TRG-MTRG-M TRG -> HRL

TRG: Thank you very much. TUR: Çok teşekkür ederim.

AZE: Hə Hə Hə.

Available Resources + TRG-LRL and TRG-HRL Back-translation

LRL-M TRG-M

TRG -> LRL

TRG-MHRL-M

TRG -> HRL

Augmentation via Pivoting● Problem: HRL-TRG data might suffer from lack

of lexical/syntactic overlap

● Idea: Translate existing HRL-TRG data ○ Translate from HRL to LRL

LRL-HHRLTRGHRL -> LRL

TUR: Çok teşekkür ederim. TRG: Thank you so much.

AZE: Çox sağ olun. TRG: Thank you so much.

Available Resources + TRG-LRL and TRG-HRL Back-translation + Pivoting

LRL-M TRG-M

TRG-MHRL-M

TRG -> LRL

TRG -> HRL

LRL-H TRG-HHRL -> LRL

Back-Translation by Pivoting● Problem: TRG-HRL back-translated

data also suffers from lexical orsyntactic mismatch

● Idea: TRG-HRL-LRL ○ Large amount of English

monolingual data can be utilized

TRG-MHRL -> LRL

TRG -> HRL

LRL-MH

TRG: Thank you so much.

TUR: Çok teşekkür ederim. TRG: Thank you so much.

AZE: Çox sağ olun. TRG: Thank you so much.

Data w/ Various Types of Pivoting

LRL-M TRG-M

LRL-H TRG-H

LRL-MH TRG-MHRL-M

TRG -> LRL

TRG -> HRL

HRL -> LRL

Monolingual Data Copying

Monolingual Data Copying● Problem: Back-translation may help with structure,

but fail at terminology ● Idea: Use monolingual data as-is ○ Helps encourage the model to not drop words○ Helps translation of terms that are identical across

languages

TRG Copy

TRG: Thank you so much. SRC: Thank you so much. TRG: Thank you so much.

Anna Currey, Antonio Valerio Miceli Barone, Kenneth Heafield. Copied Monolingual Data Improves Low-Resource Neural Machine Translation. WMT 2018.

Heuristic Augmentation Strategies

Dictionary-based Augmentation

1. Find rare words in the source sentences2. Use a language model to predict another word that could

appear in that context

3. Replace, and aligned word with translation from dictionary

Marzieh Fadaee, Arianna Bisazza, Christof Monz. Data Augmentation for Low-Resource Neural Machine Translation. ACL 2017.

An Aside: Word Alignment• Automatically find alignments between source and target words for

dictionary learning, analysis, supervised attention etc.

• Traditional symbolic methods: word-based translation models trained using EM algorithm

• GIZA++: https://github.com/moses-smt/giza-pp

• FastAlign: https://github.com/clab/fast_align

• Neural methods: use model like multilingual BERT or translation and find words with similar embeddings

• SimAlign: https://github.com/cisnlp/simalign

Word-by-word Data Augmentation• Even simpler, translate sentences word-by-word into

target sentence using dictionary

• Problem: what about word ordering, syntactic divergence?

J'ai acheté une nouvelle voiture

I bought a new car

私は新しい⾞を買った

I the new car a boughtLample, Guillaume, et al. "Unsupervised machine translation using monolingual corpora only." arXiv preprint arXiv:1711.00043 (2017).

Word-by-word Augmentation w/ Reordering

Zhou, Chunting, et al. "Handling Syntactic Divergence in Low-resource Machine Translation." arXiv preprint arXiv:1909.00040 (2019).

• Problem: Source-target word order can differ significantly in methods that use monolingual pre-training

• Solution: Do re-ordering according to grammatical rules, followed by word-by-word translation to create pseudo-parallel data

In-class Assignment

In-class Assignment• Read one of the cited papers on heuristic data

augmentation Marzieh Fadaee, Arianna Bisazza, Christof Monz. Data Augmentation for Low-Resource Neural Machine Translation. ACL 2017.

Zhou, Chunting, et al. "Handling Syntactic Divergence in Low-resource Machine Translation." EMNLP 2019.

• Try to think of how it would work for one of the languages you're familiar with

• Are there any potential hurdles to applying such a method? Are there any improvements you can think of?

CS11-737 Multilingual NLP Data-based Strategies to Low...

Documents

Transcript of CS11-737 Multilingual NLP Data-based Strategies to Low...

11-731 (Spring2013) Lecture 22: Example-Based Machine ...demo.clab.cs.cmu.edu/sp2013-11731/slides/22.ebmt.pdfBurrows-Wheeler Transform 0 4 A 1 10 b 2 7 e 3 11 e 4 1 l 5 8 q 6 9 q 7

New CS11-737: Multilingual Natural Language Processing Translationdemo.clab.cs.cmu.edu/11737fa20/slides/multiling-06... · 2020. 9. 17. · Use Google translate to back-translate

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

Advisors: Artur Dubrawski & Alexandra Chouldechova Joint PhD …demo.clab.cs.cmu.edu › ethical_nlp2019 › slides › BiasInBios.pdf · 2019-04-22 · Advisors: Artur Dubrawski

“Deep”&Learning&demo.clab.cs.cmu.edu/NLP/F20/files/slides/26-deep... · 2020. 11. 24. · 2 natural&& language& analyzer& Big&picture:&natural&language&analyzers& Natural&language&

Syntax-Based Translation Models – Principles, Approaches ...demo.clab.cs.cmu.edu/sp2013-11731/slides/13.syntax1.pdf · Syntax-Based Translation Models – Principles, Approaches,

Using robots to model animals: a cricket test Barbara Webb Presenter: Gholamreza Haffari.

1 Gholamreza Haffari Simon Fraser University MT Summit, August 2009 Machine Learning approaches for dealing with Limited Bilingual Data in SMT.

What is Ethics?demo.clab.cs.cmu.edu/11711fa18/slides/FA18 11-711 lecture 28 - ethics.pdf · What is Ethics? “Ethics is a study of what are good and bad ends to pursue in life and

1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.

Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,

Inductive Semi-supervised Learninganoop/papers/pdf/semisup06.pdf · Inductive Semi-supervised Learning with Applicability to NLP Anoop Sarkar and Gholamreza Haffari anoop,ghaffar1@cs.sfu.ca

Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski

1 Gholamreza Haffari Simon Fraser University PhD Seminar, August 2009 Machine Learning approaches for dealing with Limited Bilingual Data in SMT.

Should Neural Network Architecture Reﬂect Linguistic ...demo.clab.cs.cmu.edu/cdyer/iclr-ling.pdf · Should Neural Network Architecture Reﬂect Linguistic Structure? Wang Ling (Google/DeepMind)

Analysis of Semi-supervised Learning with the Yarowsky Algorithm Gholamreza Haffari School of Computing Sciences Simon Fraser University.

The Language of Manipulation: Propaganda and Computational …demo.clab.cs.cmu.edu/ethical_nlp2019/slides/Propaganda... · 2019-03-07 · What is the role of technology in propaganda?

11-737 Multilingual NLPdemo.clab.cs.cmu.edu/11737fa20/slides/multilingual-11... · 2020. 12. 8. · 11-737 Multilingual NLP Pidgins and Creoles Creoles have native speakers, Pidgins

“Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Blizzard Challenge 2005: Evaluating corpus-based speech …demo.clab.cs.cmu.edu/NLP/F20/files/slides/28-speech_b.pdf · 2020. 11. 17. · Pronunciation Lexicon List of words and their