Post on 15-Jan-2016
description
Spoken Language Translation
Spoken Language Translation
Spoken Language Translation
1Intelligent Robot Lecture Note
Spoken Language Translation
2
Spoken Language Translation
Intelligent Robot Lecture Note
Spoken Language Translation
Spoken Language Translation
• Spoken language translation (SLT) is to directly translate spoken utterances into another language.
• Major components Automatic Speech Recognition (ASR) Machine Translation (MT) Text-to-Speech (TTS)
3
ASRASR MTMT TTSTTSSourceSpeech
SourceSentence
TargetSentence
TargetSpeech
버스 정류장이어디에 있나요 ?
Where isthe bus stop?
Intelligent Robot Lecture Note
Spoken Language Translation
Spoken Language Translation
• In comparison with written language, Speech and especially spontaneous speech poses additional
difficulties for the task of automatic translation. Typically, these difficulties are caused by errors of the speech
recognition step, which is carried out before the translation process. As a result, the sentence to be translated is not necessarily well-
formed from a syntactic point-of-view.
• Why a statistical approach for machine translation? Even without recognition errors, structures of spontaneous speech
differ from those of written language. The statistical approach
Avoid hard decisions at any level of the translation process For any source sentence, a translated sentence in the target language is
guaranteed to be generated.
4Intelligent Robot Lecture Note
Spoken Language Translation
Coupling ASR to MT
• Motivation ASR cannot secure an error-free system
One best of ASR could be wrong SLT must be designed robust to speech recognition errors MT could be benefited from wide range of supplementary information
provided by ASR MT quality may depend on WER of ASR
Strong correlation between recognition and translation quality WER of ASR decreases in a set of hypotheses Idea : Exploitation of more transcriptions
• SLT systems vary in the degree to which SMT and ASR are integrated within the overall translation process.
5Intelligent Robot Lecture Note
Spoken Language Translation
Coupling ASR to MT
• Loose coupling SMT uses ASR output (1-best, N-best, lattice, or confusion network)
as input for 1-way module communication
• Tight coupling The whole search space of ASR and MT is integrated
6
ASRASR SMTSMT TTSTTSSourceSpeech
1-best,N-best,Lattice,or CN
TargetSentence
TargetSpeech
ASR + SMTASR + SMT TTSTTSSourceSpeech
TargetSentence
TargetSpeech
Intelligent Robot Lecture Note
Spoken Language Translation
Coupling ASR to MT
• Statistical spoken language translation Given a speech input x in the source language, find the best
translation e
F(o) is a set of possible transcriptions Loose coupling : 1-best, N-best, lattice, or confusion network Tight coupling : full search space
Pr(f,e|x) : speech translation model Acoustic and translation features
7
)|Pr(maxargˆ 11,
ˆ
11
TI
eI
I xeeI
)|,Pr(maxmaxarg 111)(, 11
TIJ
oFfeI
xefJI
Intelligent Robot Lecture Note
Spoken Language Translation
Coupling ASR to MT
• Loose coupling vs. Tight couplings
8
Loose Coupling Tight Coupling
Modularity of Knowledge Sources
Each KS in stand-alone module
All KSs integrated in single model
Inter-module Communication
Typically one-way (pipelined)
N/A
Scalability Easy Not easy
Complexity FeasibleFeasible only for
very small domains
Intelligent Robot Lecture Note
Spoken Language Translation
ASR Outputs
• Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words.
• Architecture
9
FeatureExtraction
Decoding
AcousticModel
PronunciationModel
LanguageModel
Speech Signals ASR outputs( 1-best, N-best,Lattice, or CN )Network
Construction
SpeechDB
TextCorpora
HMMEstimation
G2P
LMEstimation
Intelligent Robot Lecture Note
Spoken Language Translation
ASR Outputs
• Network Structure
• Decoding of HMM-based ASR Searching the best path in a huge HMM-state lattice
10
ONE TWO ONETHREEONE TWO THREE ONESentence HMM
W AH N
1 2 3
ONEWord HMM
Phone HMM W
Intelligent Robot Lecture Note
Spoken Language Translation
ASR Outputs
• 1-best The best path could find from back tracking Why a 1-best “word” sequence?
Storing the backtracking pointer table for state sequence takes a lot of memory
Usually a backtrack pointer storing : The previous words before the current word
• N-best Traceback not only from the 1st-best, also from the 2nd best and 3rd
best, etc. Methods
Directly from search backtrack pointer table – Exact N-best algorithm, Word pair N-best algorithm, A* search using Viterbi score as heuristic
Generate lattice first, then generate N-best from lattice
11Intelligent Robot Lecture Note
Spoken Language Translation
ASR Outputs
• Lattice A word-based lattice
A compact representation of state-lattice Only word node are involved
From the decoding backtracking pointer table Only record all the links between word nodes
From N-best list Become a compact representation of N-best
12Intelligent Robot Lecture Note
Spoken Language Translation
ASR Outputs
• Confusion Network (L. Mangu et al., 2000) Or “Sausage Network” Or “Consensus Network” A weighted directed graph with a start node, an end node, and word
labels over its edges Each path from the start node to the end node goes through all the
other nodes From lattice
Multiple alignment
13Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : 1-best
• The best hypothesis produced by the ASR system is passed as a text to the MT system.
Baseline Simple structure Fast translation
• The speech recognition module and translation module are running rather independently
Lacks joint optimality
• No use of multiple transcriptions Supplementary information easily available from the ASR system were
not exploited in the translation process
14Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : 1-best
• Structure
Recognition
Translation
15
ASRASR SMTSMT TTSTTSSourceSpeech
1-best TargetSentence
TargetSpeech
)()|(maxargˆ111
,
ˆ
11
JJT
fJ
J fPfxPfJ
)()|ˆ(maxargˆ 11
ˆ
1,
ˆ
11
IIJ
eI
I ePefPeI
Ieˆ
1Jfˆ
1Tx1
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : N-best
• N hypotheses are translated by a text MT decoder and re-ranked according to ASR & SMT scores (R. Zhang et al., 2004)
• Structure
16
ASRASR SMTSMT RescoreRescoreSourceSpeech
N-best NxMtranslation
Besttranslation
Tx1
],1[ˆ,],2,1[ˆ],1,1[ˆ ˆ
1
ˆ
1
ˆ
1 Meee III
],2[ˆ,],2,2[ˆ],1,2[ˆ ˆ
1
ˆ
1
ˆ
1 Meee III
],[ˆ,],2,[ˆ],1,[ˆ ˆ
1
ˆ
1
ˆ
1 MNeNeNe III
]1[ˆ ˆ
1Jf
]2[ˆ ˆ
1Jf
][ˆ ˆ
1 Nf J
][ˆ ˆ
1 nf J ],[ˆ ˆ
1 nne I Ieˆ
1
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : N-best
• ASR module To generate N-best speech recognition hypotheses : n-th best speech recognition hypothesis
• SMT module To generate M-best translation hypotheses : m-th best translation hypotheses produced from
• Rescore module To rescore all NXM translations Key component Log linear model
Features derived from ASR and SMT are combined in this module to rescore translation candidates.
17
][ˆ ˆ
1 nf J
],[ˆˆ
1 mne I ][ˆ ˆ
1 nf J
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : N-best
• Rescore : Log-linear models
: all possible translation hypotheses : m-th feature in log value
ASR features : acoustic model, source language model SMT features : target language model, phrase translation model, distortion
model, length model, … : weight of each feature
18
M
mmm
EEXPE
1
),(logmaxarg
E
M
mmm
M
mmm
EXf
EXfXEP
1
1
)),(exp(
)),(exp()|(
),( EXPm
E
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : N-best
• Parameter optimization (F.J. Och, 2003) Objective function
: translation output after log-linear model rescoring : references of English sentences : automatic translation quality metrics
BLUE : A weighted geometric mean of the n-gram matches between test and reference sentences plus a short sentence penalty
NIST : An arithmetic mean of the n-gram matches between test and reference sentences
mWER : multiple reference word error rate mPER : multiple reference position independent word error rate
19
),( 1 ssM ERDoptimize
sE
sR),( ss ERD
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : N-best
• Parameter optimization : Direction Set Methods
20
Change initial lambda
Local optimization
Change Direction
Local lambda
Best lambda
),( ss ERD
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice
• Lattice-based MT Input
Word lattices produced by the ASR system Directly integrate all models in the decoding process
Phrase based lexica, single word based lexica, recognition features Problem
How to translate the word lattices?
• Approach Joint probability approach
WFST (E. Matusov et al., 2005) Phrase-based approach
Log-linear model (E. Matusov et al., 2005) WFST (L. Mathias et al., 2006)
21Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice
• Structure
22
ASRASR RescoreRescoreSourceSpeech
Besttranslation
Tx1Jfˆ
1Ieˆ
1
Word lattice
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice
• From the derived decision rule : Standard acoustic model
: Target language model
: Translation model
• Source language model? To take into account requirement for the well-formedness of the
source sentence, the translation model has to include context dependency on the previous source words
This dependency for the whole sentence can be approximated by including a source language model
23Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice(Joint Probability Approach : WFST)
• Joint probability approach The conditional probability term and can be
rewritten when using a joint probability translation model
This simplifies coupling the systems The joint probability translation model can be used instead of the usual LM
in ASR
24
)|Pr(maxargˆ 11,
ˆ
11
TI
eI
I xeeI
)|Pr()|Pr(max)Pr(maxarg 11111, 11
TTIT
f
I
eI
fxefeJI
)|Pr(),Pr(maxmaxarg 1111, 11
TTIT
feI
fxefJI
)|Pr( 11IT ef)Pr( 1
Ie
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice(Joint Probability Approach : WFST)
• WFST-Based Joint Probability System The joint probability MT system is implemented with WFST First, the training corpus is transformed based on a word alignment
Then, a statistical m-gram model is trained on the bilingual corpus This language model is represented as a finite-state transducer
which is the final translation model
25
vorrei|I’d_like del|some gelato|ice_creamper|ε favore|please
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice(Joint Probability Approach : WFST)
• WFST-Based Joint Probability System Searching for the best target sentence is done in the composition of
the input represented as a WFST and the translation transducer . Coupling the FSA system with ASR is simple
The output of the ASR represented as WFST can be used directly as input to the MT search
Feature– Only Acoustic, translation probability
The Source LM scores are not included– The joint m-gram translation probability serve as a source LM
26Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice(Phrase-based Approach : Log-linear Model)
• Probability distributions are represented as features in a loglinear model
The translation model probability is decomposed into several probabilities
Acoustic model and source langue model probabilities are also included
For a hypothesized recognized source sentence f1J and a
hypothesized translation e1I, let k → (jk, ik), k = 1,…,K be a monotone
segmentation of the sentence pair into K bilingual phrases
27Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model)
• Features The m-gram target langue model
The phrasal lexicon models The phrase translation probabilities are computed as a log-linear
interpolation of the relative frequencies
The single word based lexicon models
28Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model)
• Features (con’t) c1, c2 : word, phrase penalty feature The recognition model
The acoustic model probability The m-gram source langue model probability
• Optimization All features are scaled with a set of exponents λ = λ1,…,λ7 and μ =
μ1,μ2. The scaling factors are optimized in a minimum error training
framework iteratively by performing 100 to 200 translations of a development set
The criterion : WER, BLEU, mWER, mPER29Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model)
• Practical aspects of lattice translation Generation of Word Lattices
In a first step, We mapped all entities that were not spoken words onto the empty arc label ε
The time information is not used - Remove it from the lattices The structure is compressed by applying ε-removal, determinization, and
minimization This step significantly reduced runtime without changing the results
Phrase Extraction The number of different phrase pairs is very large Candidate phrase pairs have to be kept in main memory In case of ASR word lattice input, the lattice for each test utterance is
traversed, and only phrases which match sequences of arcs in the lattice are extracted
Thus only phrases which can be used in translation will be loaded
30Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model)
• Practical aspects of lattice translation (Con’t) Pruning
A word lattice of high density as input → an enormous search space → pruning is necessary
Coverage pruning and histogram pruning Based on the total costs of a hypothesis It may also be necessary to prune the input word lattices
• Advantage The utilization of multiple features The direct optimization for an objective error measure
• Disadvantage A less efficient search Heavy pruning unavoidable
31Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : WFST)
• Statistical Modeling for Text Translation
Ω : All foreign phrase sequences that could have generated the foreign text
The translation system effectively translates phrase sequences, rather than word sequences
This is done by first mapping the sentence into all its phrase sequences
32Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : WFST)
• Phrase Sequence Lattice contains the phrase sequence that can be extracted from the text
All phrase sequences correspond to the unique foreign sentence Here, a phrase is a sequence of word which can be translated Different phrase sequences lead to different translations The lattice is unweighted
33Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : WFST)
• Statistical Modeling for Speech Translation
The Target Phrase Mapping transducer is applied to the foreign language ASR word lattice
L·Ω : The likely foreign phrase sequences that could have generated the foreign speech
The translation system still effectively translates phrase sequences, rather than word sequences
These are extracted from the ASR lattice, with ASR score, rather than from a text sentence
34Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : WFST)
• Phrase Sequence Lattice contains the phrase sequences that can be extracted from the text
Phrase sequences correspond to the translatable word sequences in the lattice
The lattice contains weights from the ASR system Translating this foreign phrase lattice is MAP translation of the foreign
speech under the generative model
35Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Lattice (Phrase-based Approach : WFST)
• Spoken language translation is recast as an ASR analysis problem in which the goal is to extract translatable foreign language phrases from ASR word lattices
Step 1. Perform foreign language ASR to generate a foreign language word lattice L
Step 2. Analyze the foreign language word lattice and extract the phrases to be translated
Step 3. Build the target language phrase mapping transducer Ω Step 4. Compose L and to create the foreign language ASR Phrase
Lattice Ω Step 5. Translate the foreign language phrase lattice
• ASR and MT must be very compatible for this approach
36Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• CN-based decoder (N. Bertoldi et al., 2005) Input
Confusion network represented as a matrix Text vs. CN
– Text
– CN
Problem How to translate confusion network input?
37
나 는 소년 입니다 .
나 1.0 는 0.7은 0.3
소녀 0.6소년 0.4
입니다 0.5입니까 0.3합니다 0.2
. 0.8? 0.2
Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• Solution Simple! CN-based SLT decoder can be developed starting from phrase-based
SMT decoder CN-based SLT decoder is substantially the same as the phrase-based
SMT decoder apart from the way the input is managed
• Compare to N-best methods N-best Decoder
Does not advantage from overlaps among N-best CN Decoder
Exploits overlaps among hypotheses
38Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• Phrase-based Translation Model Phrase
Sequence of consecutive words Alignment
Map between CN and target phrases one word per column aligned with a target phrase
Search criterion
is a log-linear phrase-based model
39Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• Log-Linear Phrase-based Translation Model The conditional distribution is determined through suitable
real valued feature functions , and takes the parametric form:
Feature functions Language model Fertility models Distortion models Lexicon model Likelihood of the path within CN True length of the path
40Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• Step-wise translation process Translation is performed with a step-wise process Each step translates a sub-CN and produces a target phrase The process starts with a empty translations After each step, we get a partial translation A partial translation is complete if the whole input CN is translated
• Complexity Reduction Recombining theories Beam search Reordering constraints Lexicon pruning Confusion network pruning
41Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• Algorithms
42Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling : Confusion Network
• Step-wise translation process
43Intelligent Robot Lecture Note
Spoken Language Translation
Loose Coupling
1-Best N-Best Lattice CN
Multiple hypotheses? X O O O
ASR features into MT decoding?
X X O O
Overlaps among hypotheses?
X O O X
Approximation for word lattice?
X O X O
44Intelligent Robot Lecture Note
Spoken Language Translation
Tight Coupling
• Theory (H. Ney, 1999)
Three factors Pr(e) : target language model Pr(f|e) : translation model Pr(x|f) : acoustic model
45
)|Pr(maxargˆ 11,
ˆ
11
TI
eI
I xeeI
)|Pr()Pr(maxarg 111, 1
ITI
eI
exeI
)|,Pr()Pr(maxarg11
1111,
JI
f
ITJI
eI
exfe
)|Pr()|Pr(max)Pr(maxarg 11111, 11
TTIT
f
I
eI
fxefeJI
),|Pr()|Pr()Pr(maxarg11
111111,
JI
f
IJTIJI
eI
efxefe
)|Pr()|Pr()Pr(maxarg11
11111,
JI
f
JTIJI
eI
fxefe
Baye’s Rule
Introduce fas hidden variable
Baye’s Rule
Assume x doesn’tdepend on target language
Sum to Max
Intelligent Robot Lecture Note
Spoken Language Translation
Tight Coupling
• ASR vs. Tight Coupling (SLT)
Brute Force Method Instead of incorporating LM into standard Viterbi algorithm, incorporating
P(e) and P(f|e) Very complicated Not feasible
46
ASR
vs
SLT
AcousticModel
AcousticModel
SourceLM
SourceLM
AcousticModel
AcousticModel
TargetLM
TargetLM
TranslationModel
TranslationModel
Intelligent Robot Lecture Note
Spoken Language Translation
Tight Coupling
• WFST-Based Joint Probability System (Fully integration) The ASR search network
A composition of WFSTs : the HMM topology : the context-dependency : the lexicon : the LM Only need to replace the source LM by the translation model
Speech translation search network ST
Result Small improvement of translation quality But, very slow
47
CH
LG
GrT
))det()det(det( 11 TrLCHST
Intelligent Robot Lecture Note
Spoken Language Translation
Tight Coupling
• Bleu scores against lattice density (S.Saleem et al, 2004)
Improvements from tighter coupling may only be observed when ASR lattices are sparse, i.e. when there are only few hypothesized words per spoken word in the lattice
This would mean that a fully integrated speech translation would not work at all.
48Intelligent Robot Lecture Note
Spoken Language Translation
Tight Coupling
• Possible issues of tight coupling In ASR, source n-gram LM is very closed to the best configuration The complexity of the algorithm is too high, approximation is still
necessary to make it work The current approaches still haven’t really implement tight-coupling
• Conclusion The approach seem to be haunted by very high complexity of search
algorithm construction
49Intelligent Robot Lecture Note
Spoken Language Translation
Reading List
• L. Mangu, E. Brill, A. Stolcke. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language 14(4), 373-400.
• V. H. Quan, M. Federico, M. Cettolo. 2005. Integrated N-best Re-ranking for Spoken Language Translation. EuroSpeech.
• R. Zhang, G. Kikui, H. Yamamoto, T. Watanabe, F. Soong, and W. K. Lo. 2004. A unified approach in speech-to-speech translation: Integrating features of speech recognition and machine translation. In Proc. of Coling 2004.
• F.J. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of ACL.
• E. Matusov, S. Kanthak, and H. Ney. 2005. On the Integration of Speech Recognition and Statistical Machine Translation. in Proc. Interspeech 2005.
• E. Matusov, H. Ney, R. Schluter. 2005. Phrase-based Translation of Speech Recognizer Word Lattices Using Loglinear Model Combination. ASRU 2005.
50Intelligent Robot Lecture Note
Spoken Language Translation
Reading List
• E. Matusov, H. Ney, R. Schluter. 2006. Integrating Speech Recognition And Machine Translation : Where Do We Stand. ICASSP 2006.
• L. Mathias, W. Byrne.2006. Statistical Phrase-based Speech Translation. ICASSP 2006.
• N. Bertoldi, M. Federico. 2005. A new decoder for spoken language translation based on confusion networks. in IEEE ASRU Workshop.
• H. Ney. 1999. Speech translation: Coupling of recognition and translation. in Proc. ICASSP.
• S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, 2004. Using word lattice information for a tighter coupling in speech translation systems. in Proc. ICSLP, 2004..
51Intelligent Robot Lecture Note