Language Modeling for Code-Mixing: The Role of Linguistic ... · Language Modeling for Code-Mixing:...

1
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data Challenges in modeling CM language, CM is rare in formal text Even in the available CM data, switch points are few (~10%) Can we leverage the readily available monolingual data? Code-Mixing (CM) refers to juxtaposition of linguistic units from two or more languages in a single conversation/utterance Language model (LM) has applications in ASR, Machine Translation 1. Introduction 3. Generating Synthetic Data 6. Experiments and Results 4. Sampling gCM data 0 500 1000 1500 2000 2500 3000 3500 4000 4500 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 1000000 A pair of monolingual sentences can give rise to a large (exponential) number of CM sentences, but only a few are observed in real data Even the statistical properties of this gCM data are different from real CM data Expt. ID Training Curricula Overall PPL Avg. SP PPL Test-17 Test-14 Test-17 Test-14 1 rCM 2018 1822 5670 8864 2 Mono 1607 892 23790 26901 3 Mono rCM 1041 861 4824 7913 4(a) Mono gCM 4(a)Mono χ-gCM 1771 1119 5869 6065 4(a)-↑ Mono ↑-gCM 1872 1208 9167 8803 4(a)Mono ρ-gCM 1618 1116 6618 7293 4(b) gCM Mono 4(b)χ-gCM Mono 1680 903 21028 20300 4(b)-↓ -gCM Mono 1917 973 28722 25006 4(b)ρ-gCM Mono 1641 871 26710 22557 5(a) Mono gCM rCM 5(a)Mono χ-gCM rCM 1038 836 4386 5958 5(a)-↑ Mono ↑-gCM rCM 1058 961 5078 6861 5(a)Mono ρ-gCM rCM 1011 830 4829 6807 5(b) gCM Mono rCM 5(b)χ-gCM Mono rCM 1019 790 4987 7018 5(b)-↓ -gCM Mono rCM 1025 800 5489 7476 5(b)ρ-gCM Mono rCM 986 772 4912 6547 5. Training Curricula Mono rCM gCM LM can be trained sequentially on different orderings of Mono, gCM and rCM resulting in various training curricula Real CM (rCM) data at the end of training is found to be most effective (Baheti et al. 2017) # rCM Expt. 0.5K 1K 2.5K 5K 10K 50K 3 1238 1186 1120 1041 991 812 5(b)1181 1141 1068 986 951 808 Stanford PCFG parser Used Pseudo Fuzzy-match score to threshold the quality of translations En-parse is projected onto the Es sentence using word-level alignments rCM train, validation and test-17 (Rijhwani et al. 2017), test-14 (Solorio et al. 2014) Dataset # Tweets # Words CMI SPF English 100K 850K 0 0 Spanish 100K 860K 0 0 Train 100K 1.4M 0.31 0.105 Validation 100K 1.4M 0.31 0.106 Test-17 83K 1.1M 0.31 0.104 Test-14 13K 138K 0.12 0.06 gCM 31M 463M 0.75 0.35 Effect of rCM size: As expected, PPL drops with increasing amount of rCM data gCM data still helps even though diminishingly In general, the baseline (Model 3) needs twice as much amount of rCM data to perform as good as our Model 5(b)-ρ Even though gCM helps, rCM data is indispensable SPF based sampling performs the best PPL at SPs is much higher than overall PPL, showing the inherent complexity of modeling CM language Modeling shorter run lengths is found to be challenging 2. Linguistic Models of Code-Mixing Equivalence Constraint Theory: (Poplack, 1980; Sankoff, 1998) Any monolingual fragment in CM sentence must occur in one of the monolingual sentences CM sentence shouldn’t deviate from both monolingual grammars The two grammars must be equivalent at switch points She will solve the puzzle ella resolverá el rompecabezas S E NP E VP E PRP E MD E VP E VB E NP E DT E She will solve the NN E puzzles S E NP E VP E PRP E MD+VB E VP E NP E DT E She will solve the NN E puzzles S s NP s VP s PRP s MD+VB s VP s NP s DT s Ella el NN s resolverá rompecabezas S 1 S 2 S 4 S 5 S 3 She Ella will solve resolverá S 6 the el puzzle rompecabezas Orig. En parse Modified En parse Projected Es parse Parallel sentences with alignments Lattice Baselines Hence the need for sampling, based on, Random (χ-gCM) Code mixing index (CMI) (↑-gCM & ↓-gCM ) Switch point fraction (SPF) (ρ-gCM) Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, Kalika Bali Best Model Feature extractor Decoder Acoustic Model Pronunciation Model Language Model - I am not going to कम come tomorrow - este event es un gran exit éxito Contact: [email protected] Mono gCM rCM En Parse Word Level Alignments gCM Fast Align Parallel data Original Es Bing Translator API Original En fractional increase in word freq. in gCM vs Original freq. of unigrams. # gCM sentences vs avg. input sentence length References: Ashutosh Baheti, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. 2017. Curriculum design for code-switching: Experiments with language identification and language modeling with deep neural networks. In Proc. of ICON-2017, Kolkata, India, pages 65–74. Shana Poplack. 1980. Sometimes Ill start a sentence in Spanish y termino en espaol. Linguistics, 18:581– 618. Shruti Rijhwani, R Sequiera, M Choudhury, K Bali, and C S Maddila. 2017. Estimating code-switching on Twitter with a novel generalized word-level language identification technique. In ACL. David Sankoff. 1998. A formal production-based explanation of the facts of code-switching. Bilingualism: language and cognition, 1(01):39–50. Thamar Solorio et al. 2014. Overview for the first shared task on language identification in codeswitched data. In 1st Workshop on Computational Approaches to Code Switching, EMNLP, pages 62– 72.

Transcript of Language Modeling for Code-Mixing: The Role of Linguistic ... · Language Modeling for Code-Mixing:...

Page 1: Language Modeling for Code-Mixing: The Role of Linguistic ... · Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data Challenges in modeling CM language,

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

Challenges in modeling CM language,• CM is rare in formal text• Even in the available CM data, switch points are few (~10%)

Can we leverage the readily available monolingual data?

• Code-Mixing (CM) refers to juxtaposition of linguistic units from two or more languages in a single conversation/utterance

• Language model (LM) has applications in ASR, Machine Translation

1. Introduction

3. Generating Synthetic Data

6. Experiments and Results

4. Sampling gCM data

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000 1000000

• A pair of monolingual sentences can give rise to a large (exponential) number of CM sentences, but only a few are observed in real data

• Even the statistical properties of this gCM data are different from real CM data

Expt. ID Training Curricula Overall PPL Avg. SP PPL

Test-17 Test-14 Test-17 Test-14

1 rCM 2018 1822 5670 8864

2 Mono 1607 892 23790 26901

3 Mono rCM 1041 861 4824 7913

4(a) Mono gCM

4(a)-χ Mono χ-gCM 1771 1119 5869 6065

4(a)-↑ Mono ↑-gCM 1872 1208 9167 8803

4(a)-ρ Mono ρ-gCM 1618 1116 6618 7293

4(b) gCM Mono

4(b)-χ χ-gCM Mono 1680 903 21028 20300

4(b)-↓ ↓-gCM Mono 1917 973 28722 25006

4(b)-ρ ρ-gCM Mono 1641 871 26710 22557

5(a) Mono gCM rCM

5(a)-χ Mono χ-gCM rCM 1038 836 4386 5958

5(a)-↑ Mono ↑-gCM rCM 1058 961 5078 6861

5(a)-ρ Mono ρ-gCM rCM 1011 830 4829 6807

5(b) gCM Mono rCM

5(b)-χ χ-gCM Mono rCM 1019 790 4987 7018

5(b)-↓ ↓-gCM Mono rCM 1025 800 5489 7476

5(b)-ρ ρ-gCM Mono rCM 986 772 4912 6547

5. Training CurriculaMono rCMgCM• LM can be trained sequentially on

different orderings of Mono, gCMand rCM resulting in various training curricula

• Real CM (rCM) data at the end of training is found to be most effective (Baheti et al. 2017)

# rCMExpt.

0.5K 1K 2.5K 5K 10K 50K

3 1238 1186 1120 1041 991 812

5(b)-ρ 1181 1141 1068 986 951 808

Stanford PCFG parser

• Used Pseudo Fuzzy-match score to threshold the quality of translations

• En-parse is projected onto the Es sentence using word-level alignments

• rCM train, validation and test-17 (Rijhwani et al. 2017), test-14 (Solorio et al. 2014)

Dataset # Tweets # Words CMI SPF

English 100K 850K 0 0

Spanish 100K 860K 0 0

Train 100K 1.4M 0.31 0.105

Validation 100K 1.4M 0.31 0.106

Test-17 83K 1.1M 0.31 0.104

Test-14 13K 138K 0.12 0.06

gCM 31M 463M 0.75 0.35

• Effect of rCM size:• As expected, PPL drops with increasing amount of rCM data• gCM data still helps even though diminishingly• In general, the baseline (Model 3) needs twice as much amount

of rCM data to perform as good as our Model 5(b)-ρ• Even though gCM helps, rCM data is indispensable• SPF based sampling performs the best• PPL at SPs is much higher than overall PPL, showing the inherent

complexity of modeling CM language• Modeling shorter run lengths is found to be challenging

2. Linguistic Models of Code-Mixing

• Equivalence Constraint Theory: (Poplack, 1980; Sankoff, 1998)• Any monolingual fragment in CM sentence must occur in

one of the monolingual sentences• CM sentence shouldn’t deviate from both monolingual

grammars• The two grammars must be equivalent at switch points

She will solve the puzzle

ella resolverá el rompecabezas

SE

NPE VPE

PRPE MDE VPE

VBE NPE

DTE

She will

solve

the

NNE

puzzles

SE

NPE VPE

PRPE MD+VBE VPE

NPE

DTE

She will solve

the

NNE

puzzles

Ss

NPs VPs

PRPs MD+VBs VPs

NPs

DTs

Ella

el

NNs

resolverá

rompecabezas

S1 S2 S4 S5

S3She

Ella

will solve

resolverá

S6

the

el

puzzle

rompecabezas

Orig. En parse

Modified En parse

Projected Es parse

Parallel sentences with alignments

Lattice

Baselines

• Hence the need for sampling, based on,• Random (χ-gCM)• Code mixing index (CMI) (↑-gCM & ↓-gCM ) • Switch point fraction (SPF) (ρ-gCM)

Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, Kalika Bali

Best Model

Feature extractor

Decoder

Acoustic Model

Pronunciation Model

Language Model

- I am not going to कमcome tomorrow- este event es un granexit éxito

Contact: [email protected]

MonogCM rCM

En Parse

Word Level Alignments gCM

Fast Align

Parallel data

Original Es

Bing Translator API

Original En

fractional increase in word freq. in gCM vs Original freq. of unigrams.

# gCM sentences vs avg. input sentence length

References: Ashutosh Baheti, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. 2017. Curriculum design for code-switching: Experiments with language identification and language modeling with deep neural networks. In Proc. of ICON-2017, Kolkata, India, pages 65–74. Shana Poplack. 1980. Sometimes Ill start a sentence in Spanish y termino en espaol. Linguistics, 18:581– 618. Shruti Rijhwani, R Sequiera, M Choudhury, K Bali, and C S Maddila. 2017. Estimating code-switching on Twitter with a novel generalized word-level language identification technique. In ACL. David Sankoff. 1998. A formal production-based explanation of the facts of code-switching. Bilingualism: language and cognition, 1(01):39–50. Thamar Solorio et al. 2014. Overview for the first shared task on language identification in codeswitched data. In 1st Workshop on Computational Approaches to Code Switching, EMNLP, pages 62– 72.