Using Parallel Features in Parsing of Machine-Translated ...
Transcript of Using Parallel Features in Parsing of Machine-Translated ...
![Page 1: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/1.jpg)
Rudolf Rosa, Ondřej Dušek, David Mareček, Martin Popel{rosa,odusek,marecek,popel}@ufal.mff.cuni.cz
Using Parallel Featuresin Parsing
of Machine-Translated Sentencesfor Correction of Grammatical Errors
Charles University in PragueFaculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
SSST, Jeju, 12th July 2012
![Page 2: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/2.jpg)
Parsing of SMT Outputs
can be useful in many applications automatic classification of translation errors automatic correction of translation errors (Depfix) confidence estimation, multilingual question
answering...
✔ we have the source sentence available Can we use it to help parsing?
✗ SMT outputs noisy (errors in fluency, grammar...) parsers trained on gold standard treebanks Can we adapt parser to noisy sentences?
![Page 3: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/3.jpg)
MST Parser
Maximum Spanning Tree dependency parser by Ryan McDonald
![Page 4: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/4.jpg)
(1) Words and Tags
RudolphNNP
relaxesVBZ
abroadRB
#rootwords = nodes
![Page 5: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/5.jpg)
(2) (Nearly) Complete Graph
RudolphNNP
relaxesVBZ
abroadRB
#rootall possible edges =
directed edges
![Page 6: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/6.jpg)
(3) Assign Edge Weights
RudolphNNP
relaxesVBZ
abroadRB
#root
-7 -1659325
1490 1154
-263 24
185
368
Margin InfusedRelaxed Algorithm
(MIRA)
edge weight =
sum ofedge featuresweights
![Page 7: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/7.jpg)
(4) Maximum Spanning Tree
RudolphNNP
relaxesVBZ
abroadRB
#root
-7 -1659325
1490 1154
-263 24
185
368
non-projective trees:Chu-Liu-Edmonds algorithm
(projective trees:
Eisner algorithm)
![Page 8: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/8.jpg)
(5) Unlabeled Dependency Tree
RudolphNNP
relaxesVBZ
abroadRB
#rootdependency tree =
maximumspanning tree
![Page 9: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/9.jpg)
(6) Labeled Dependency Tree
RudolphNNP
relaxesVBZ
abroadRB
#root
Predicate
Subject Adverbial
labels asignedby a second stagelabeler
![Page 10: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/10.jpg)
RUR Parser
reimplementation of MST Parser (so far only) first-order, non-projective
adapted for SMT outputs parsing parallel features ”worsening” the training treebank
![Page 11: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/11.jpg)
English-to-Czech SMT
Czech language highly flective
4 genders, 2 numbers, 7 cases, 3 persons... Czech grammar requires agreement in related words
word order relatively free: word order errors not crucial
Phrase-Based SMT often makes inflection errors:➔ Rudolph's car is black.
✗ Rudolfova/fem auto/neut je černý/masc.
✔ Rudolfovo/neut auto/neut je černé/neut.
![Page 12: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/12.jpg)
Parser Training Data
Prague Czech-English Dependency Treebank parallel treebank 50k sentences, 1.2M words morphological tags, surface syntax, deep syntax word alignment
![Page 13: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/13.jpg)
Parallel Features
word alignment (using GIZA++) additional features (if aligned node exists):
aligned tag (NNS, VBD...) aligned dependency label (Subject, Attribute...) aligned edge existence (0/1)
![Page 14: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/14.jpg)
Parallel Features Example
RudolfNN M S 1
relaxujeVB S 3
vRR 6
zahraničíNN N S 6
RudolphNNP
relaxesVBZ
abroadRB
#root
#root
Pred
AuxP
AdvSubj
Subj
Pred
Adv
![Page 15: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/15.jpg)
Worsening the Treebank
treebank used for training contains correct sentences
SMT output is noisy grammatical errors incorrect word order missing/superfluous words …
let's introduce similar errors into the treebank! so far, we have only tried inflection errors
![Page 16: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/16.jpg)
Worsen (1): Apply SMT
translate English side of PCEDT to Czech by an SMT system (we used Moses)
now we have (e.g.): Gold English
Rudolph's car is black. Gold Czech
RudolfovoNEUT
autoNEUT
je černéNEUT
.
SMT Czech Rudolfova
FEM auto
NEUT je černý
MASC.
![Page 17: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/17.jpg)
Worsen (2): Align SMT to Gold
align SMT Czech to Gold Czech Monolingual Greedy Aligner
alignment link score = linear combination of: similarity of word forms (or lemmas) similarity of morphological tags (fine-grained) similarity of positions in the sentence indication whether preceding/following words aligned
repeat: align best scoring pair until below threshold no training: weights and threshold set manually
![Page 18: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/18.jpg)
Worsen (3): Create Error Model
for each tag: estimate probabilities of SMT system using an
incorrect tag instead of the correct tag(Maximum Likelihood Estimate)
Czech tagset: fine-grained morphological tags part-of-speech, gender, number, case, person,
tense, voice... 1500 different tags in training data
![Page 19: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/19.jpg)
Worsen (3): Error Model
Adjective, Masculine, Plural, Instrumental case(AAMP7), e.g. lingvistickými (linguistic)➔ 0.2 Adjective, Masculine, Singular, Nominative case
➔ e.g. lingvistický➔ 0.1 Adjective, Masculine, Plural, Nominative case
➔ e.g. lingvističtí➔ 0.1 Adjective, Neuter, Singular, Accusative case
➔ e.g. lingvistické
… altogether 2000 such change rules
![Page 20: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/20.jpg)
Worsen (4): Apply Error Model
take Gold Czech for each word:
assign a new tag randomly sampled according to Tag Error Model
generate a new word form rule-based generator, generates even unseen forms new_form = generate_form(lemma, tag) || old_form
→ get Worsened Czech use resulting Gold English-Worsened Czech
parallel treebank to train the parser
![Page 21: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/21.jpg)
Direct Evaluation by Inspection
manual inspection of several parse trees comparing baseline and adapted parser ouputs
examples of improvements: subject identification even if not in nominative case adjective-noun dependence identification even if
agreement violated (gender, number, case)
hard to do reliably trying to find a correct parse tree for an (often)
incorrect sentence – not well defined
![Page 22: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/22.jpg)
Indirect Evaluation: in Depfix
rule-based grammar correction of SMT outputs input = aligned, tagged and parsed sentences:
target (Czech) sentence – to be corrected source (English) sentence – additional information
applies 20 correction rules: noun – adjective agreement (gender, number, case) subject – predicate agreement (gender, number) preposition – noun agreement (case) …
![Page 23: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/23.jpg)
Depfix: Rudolph's Car
RudolphNNP
carNN
RudolfovaAA fem
autoNN neutAtr Atr
'sPOS
RudolfovoAA neut
autoNN neut
Atr
Atr
Adjective – Noun Agreement
![Page 24: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/24.jpg)
Indirect Evaluation Results
differences in Depfix corrections evaluated by humans: better / worse / indefinite
three different parsers RUR + parallel features + worsened treebank – original McDonald's MST Parser RUR – our baseline setup
RUR + parallel features + worsened treebankbetter worse indefinite51% 30% 18%
RUR 54% 28% 18%
![Page 25: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/25.jpg)
Conclusion
SMT outputs often hard to parse RUR parser – adapted to parsing SMT outputs
parallel features (tag, dep. label, edge existence) worsening the training treebank (tag error model)
outputs of English-to-Czech translation evaluated in Depfix
SMT errors correction system
![Page 26: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/26.jpg)
Future Work
more sophisticated parallel features more experiments on worsening more languages
parallel tagging
![Page 27: Using Parallel Features in Parsing of Machine-Translated ...](https://reader031.fdocuments.us/reader031/viewer/2022021623/620affe1d7585039816dac63/html5/thumbnails/27.jpg)
Thank you for your attention
For this presentation and other information, visit:
http://ufal.mff.cuni.cz/~rosa/depfix/
Rudolf Rosa, Ondřej Dušek, David Mareček, Martin Popel{rosa,odusek,marecek,popel}@ufal.mff.cuni.cz
Charles University in PragueFaculty of Mathematics and Physics
Institute of Formal and Applied Linguistics