Post on 31-Aug-2019
CS460/626 : Natural Language Processing/Speech NLP and the WebProcessing/Speech, NLP and the Web
(Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses)on Giza++ and Moses)
Pushpak BhattacharyyaPushpak BhattacharyyaCSE Dept., IIT Bombay
15th F b 201115th Feb, 2011
Going forward from word alignmentalignment
Word alignmentWord alignment
Phrase Alignment Decoding(going to bigger units (best possibleOf correspondence) translation)
Abstract ProblemAbstract Problem
Given: e e e e e e (Entities)Given: eoe1e2e3….enen+1 (Entities)
Goal: l l1l2l3 l l 1 (Labels)Goal: lol1l2l3….lnln+1 (Labels)
The Goal is to find the best possible label sequence
))|((maxarg* ELPLL
=
Generative Model
)|().(maxarg)|(maxarg LEPLPELPL
=L
SimplificationSimplification
Using Markov Assumption the LanguageUsing Markov Assumption, the Language Model can be represented using bigrams
Simila l t anslation model can also be
)|()( 10
ii
n
iLLPLP +
=∏=
Similarly translation model can also be represented in the following way:
∏=
=n
iii lePLEP
0
)|()|(
Statistical Machine Translation
Finding the best possible English sentence given the foreign sentencesentence given the foreign sentence
)|().(maxarg)|(maxarg* EFPEPFEPeE
==
P(E)= Language ModelP(F|E) Translation ModelP(F|E) = Translation ModelE: English, F: Foreign Language
Problems in the frameworkProblems in the frameworkLabels are words of the target languageLabels are words of the target language
Very large in number Who do you want to_go with ? Preposition
With whom do you want to go ?आप िकस के_साथ जाना चाहते_हो (Aap kis ke sath jaana chahate ho)
Stranding
(Aap kis ke_sath jaana chahate_ho)who whodo do and so on
you youwant wantto_go to_gowith with
Column of words of target language on the
l dsource language words
^ Aap kis ke_sath jaana chahate_ho .who whodo do and so on you youy y
^ want want … .to_go to_gowith withwith with
Find the best possible path from ‘^’ to ‘.’ using transition andObservation probabilities.
Viterbi can be usedViterbi can be used
TUTORIAL ON Giza++ and Moses tools(delivered by Kushal Ladha)
Word-based alignmentWord based alignment
For each word in source language alignFor each word in source language, align words from target language that this word possibly producespossibly producesBased on IBM models 1-5M d l 1 i l tModel 1 – simplestAs we go from models 1 to 5, models get more complex but more realisticThis is all that Giza++ does
Ali tAlignment
A function from target position to source position:
The alignment sequence is: 2,3,4,5,6,6,6Ali f i A A(1) 2 A(2) 3 Alignment function A: A(1) = 2, A(2) = 3 ..A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2)..
10
To allow spurious insertion, allow alignment with word 0 (NULL)No. of possible alignments: (I+1)J
IBM Model 1: Generative ProcessProcess
11
Training Alignment ModelsTraining Alignment Models
Given a parallel corpora, for each (F,E) learn the best alignment A and thelearn the best alignment A and the component probabilities:
t(f| ) f M d l 1t(f|e) for Model 1lexicon probability P(f|e) and alignment probability P(a |a I)probability P(ai|ai-1,I)
How to compute these probabilities if all h i ll l
12
you have is a parallel corpora
Intuition : Interdependence of ProbabilitiesProbabilities
If you knew which words are probable translation of each other then you cantranslation of each other then you can guess which alignment is probable and which one is improbablepIf you were given alignments with probabilities then you can compute p y ptranslation probabilitiesLooks like a chicken and egg problem
13
gg pEM algorithm comes to the rescue
Limitation: Only 1->Many Alignments ll dallowed
14
Phrase-based alignmentPhrase based alignment
More natural
Many-to-one mappings allowed
Giza++ and Moses PackageGiza++ and Moses Package
http://cl naist jp/~eric-n/ubuntu-nlp/http://cl.naist.jp/~eric-n/ubuntu-nlp/Select your Ubuntu versionBrowse the nlp folderDownload debian package of giza++, p g g ,moses, mkcls, srilmResolve all the dependencies and they getResolve all the dependencies and they get installedFor alternate installation refer toFor alternate installation, refer to http://www.statmt.org/moses_steps.html
StepsSteps
Input - sentence aligned parallel corpusO t t t t id t d d tOutput- target side tagged data
TrainingTuningGenerate output on test corpusGenerate output on test corpus (decoding)
TrainingTraining Create a folder named corpus containing test, train and tuning fileGiza++ is used to generate alignmentg gPhrase table is generated after trainingBefore training language model needs toBefore training language model needs to be build on target sidemkdir lm ; /usr/bin/ngram-count -order 3 -interpolate -kndiscount -text d ; /us /b / g a cou t o de 3 te po ate d scou t te t$PWD/corpus/train_surface.hi -lm lm/train.lm;/usr/share/moses/scripts/training/train-factored-phrase-model.perl -scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus train.clean -e hi -f en -l $ /l / llm 0:3:$PWD/lm/train.lm:0;
ExampleExample
train en train prtrain.enh e l l oh l l
train.prhh eh l owhh h l h e l l o
w o r l dc o m p o u n d w o r d
hh ah l oww er l dk d dc o m p o u n d w o r d
h y p h e n a t e do n e
k aa m p aw n d w er dhh ay f ah n ey t ih dow eh n iyo n e
b o o mk w e e z l e b o t t e r
ow eh n iyb uw mk w iy z l ah b aa t ah rk w e e z l e b o t t e r k w iy z l ah b aa t ah r
Sample from Phrase-tableSample from Phrase table
b ||| b ||| (0) (1) ||| (0) (1) ||| 1 0 666667 1 0 181818b o ||| b aa ||| (0) (1) ||| (0) (1) ||| 1 0.666667 1 0.181818 2.718
b ||| b ||| (0) ||| (0) ||| 1 1 1 1 2.718c o m p o ||| aa m p ||| (2) (0,1) (1) (0) (1) ||| (1,3) (1,2,4) (0)
||| 1 0.0486111 1 0.154959 2.718c ||| p ||| (0) ||| (0) ||| 1 1 1 1 2.718d w ||| d w ||| (0) (1) ||| (0) (1) ||| 1 0.75 1 1 2.718
l l o ||| l ow ||| (0) (0) (1) ||| (0,1) (2) ||| 0.5 1 1 0.227273 2.718l l ||| l ||| (0) (0) ||| (0,1) ||| 0.25 1 1 0.833333 2.718l o ||| l ow ||| (0) (1) ||| (0) (1) ||| 0.5 1 1 0.227273 2.718l ||| l ||| (0) ||| (0) ||| 0 75 1 1 0 833333 2 718d ||| d ||| (0) ||| (0) ||| 1 1 1 1 2.718
e b ||| ah b ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718e l l ||| ah l ||| (0) (1) (1) ||| (0) (1,2) ||| 1 1 0.5 0.5 2.718e l l ||| eh l ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.111111 0.5
0.111111 2.718e l ||| eh ||| (0) (0) ||| (0,1) ||| 1 0.111111 1 0.133333 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0 666667 0 6 2 718
l ||| l ||| (0) ||| (0) ||| 0.75 1 1 0.833333 2.718m ||| m ||| (0) ||| (0) ||| 1 0.5 1 1 2.718n d ||| n d ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718n e ||| eh n iy ||| (1) (2) ||| () (0) (1) ||| 1 1 0.5 0.3 2.718n e ||| n iy ||| (0) (1) ||| (0) (1) ||| 1 1 0.5 0.3 2.718n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0.666667 0.6 2.718
h e ||| hh ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718h ||| hh ||| (0) ||| (0) ||| 1 1 1 1 2.718l e b ||| l ah b ||| (0) (1) (2) ||| (0) (1) (2) ||| 1 1 1 0.5 2.718l e ||| l ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.5 2.718
n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718o o m ||| uw m ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.5 1 0.181818 2.718o o ||| uw ||| (0) (0) ||| (0,1) ||| 1 1 1 0.181818 2.718o ||| aa ||| (0) ||| (0) ||| 1 0.666667 0.2 0.181818 2.718o ||| ow eh ||| (0) ||| (0) () ||| 1 1 0.2 0.272727 2.718o ||| ow ||| (0) ||| (0) ||| 1 1 0.6 0.272727 2.718w o r ||| w er ||| (0) (1) (1) ||| (0) (1,2) ||| 1 0.1875 1 0.424242 2.718w ||| w ||| (0) ||| (0) ||| 1 0.75 1 1 2.718
TuningTuning
Not a compulsory step but will improve the decoding by a small percentagethe decoding by a small percentagemkdir tuning; cp $WDIR/corpus/tun.en tuning/input; cp $WDIR/corpus/tun.hi tuning/reference; /usr/share/moses/scripts/training/mert moses pl $PWD/tuning/input/usr/share/moses/scripts/training/mert-moses.pl $PWD/tuning/input $PWD/tuning/reference /usr/bin/moses $PWD/model/moses.ini --working-dir $PWD/tuning --rootdir /usr/share/moses/scripts
It will take around 1 hour on a server with 32GBIt will take around 1 hour on a server with 32GB RAM
TestingTesting
mkdir evaluation; /usr/bin/moses -config $WDIR/tuning/moses.ini -input-file $WDIR/corpus/test.en >evaluation/test.output;
The output will be inThe output will be in evaluation/test.output fileSample OutputSample Output
h o t hh aa th |UNK hh h ip h o n e p|UNK hh ow eh n iy
b o o k b uw k