Statistical Translation Language Model
description
Transcript of Statistical Translation Language Model
Statistical Translation Language Model
Maryam [email protected]
University of Illinois at Urbana-Champaign1
2
Outline• Motivation & Background
– Language model (LM) for IR – Smoothing methods for IR
• Statistical Machine Translation – Cross-Lingual– Motivation– IBM Model 1
• Statistical Translation Language Model – Monolingual– Synthetic Queries– Mutual Information-based approach– Regularization of self-translation probabilities
• Smoothing in Statistical Translation Language Model
The Basic LM Approach([Ponte & Croft 98], [Hiemstra & Kraaij 98], [Miller et al. 99])
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?…
…food ?nutrition ?healthy ?diet ?…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
Retrieval as LM Estimation
• Document ranking based on query likelihood | |
1 1
1 2
log ( | ) log ( | ) ( , ) log ( | )
, ...
Vm
i i ii i
m
p q d p q d c w q p w d
where q q q q
• Retrieval problem Estimation of p(wi|d)• Smoothing is an important issue, and
distinguishes different approaches
Document language model
6
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood Estimator– P(w|d) = relative frequency of word w in d– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a word that has not been observed?
• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words
• This is what “smoothing” is about …
Language Model Smoothing
P(w)
w
Max. Likelihood Estimate
wordsallofcountwofcount
ML wp )(
Smoothed LM
Smoothing Methods for IR
• Method 1(Linear interpolation, Jelinek-Mercer):
• Method 2 (Dirichlet Prior/Bayesian):
( , )( | ) (1 ) ( | )| |c w dp w d p w REFd
parameterML estimate
( ; ) ( | ) | || | | | | |
( , )( | ) ( | )| |
c w d p w REF dd d d
c w dp w d p w REFd
parameter
(Zhai & Lafferty 01)
9
Outline• Motivation & Background
– Language model (LM) for IR – Smoothing methods for IR
• Statistical Machine Translation – Cross-Lingual– Motivation– IBM Model 1
• Statistical Translation Language Model – Monolingual– Synthetic Queries– Mutual Information-based approach– Regularization of self-translation probabilities
• Smoothing in Statistical Translation Language Model
10
A Brief History
• Machine translation was one of the first applications envisioned for computers
• Warren Weaver (1949): “I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.”
• First demonstrated by IBM in 1954 with a basic word-for-word translation system
11
Interest in Machine Translation
• Commercial interest:– U.S. has invested in MT for intelligence
purposes– MT is popular on the web—it is the most used
of Google’s special features– EU spends more than $1 billion on translation
costs each year.– (Semi-)automated translation could lead to
huge savings
12
Interest in Machine Translation
• Academic interest:– One of the most challenging problems in NLP
research– Requires knowledge from many NLP sub-areas, e.g.,
lexical semantics, parsing, morphological analysis, statistical modeling,…
– Being able to establish links between two languages allows for transferring resources from one language to another
13
Word-Level Alignments
• Given a parallel sentence pair we can link (align) words or phrases that are translations of each other:
Machine Translation -- Concepts• We are trying to model P(e|f)
– I give you a French sentence– You give me back English
• How are we going to model this?– The maximum likelihood estimation of P(e | f)
is: freq(e,f)/freq(f).– Way too specific to get any reasonable
frequencies! Vast majority of unseen data will have zero counts!
Machine Translation – Alternative way
• We could use Bayes rule
• Why using Bayes rule and not directly estimating p(e|f) ?
)()|()(
)()|()|( ePefPfPePefPfeP
)()|(maxarg ePefPee
It is important that our model for p(e|f) concentrates its probability as much as possible on well-formed English sentences. But it is not important that our model for P(f|e) concentrate its probability on well-formed French sentences.
Given a French sentence f, we could do a search for an e that maximizes p(e|f).
16
Statistical Machine Translation• The noisy channel model
– Assumptions:• An English word can be aligned with multiple French words
while each French word is aligned with at most one English word
• Independence of the individual word-to-word translations
Language Model Translation Model Decoder eP
e efP
f fePe emaxargˆ
e
e: English f: French
jf
kf
jae
kae
|e|=l |f|=m
17
Estimation of Probabilities -- IBM Model 1
• Simplest of the IBM models. (There are 5 models)
• Does not consider word order (bag-of-words approach)
• Does not model one-to-many alignments• Computationally inexpensive• Useful for parameter estimations that are
passed on to more elaborate models
18
IBM Model 1• Three important components involved
– Language model• Give the probability p(e).
– Translation model • Estimate the Translation Probability p(f|e).
– Decoder
efpepfpefpep
fePeeeemaxargmaxargmaxargˆ
19
IBM Model 1- Translation Model
• Joint probability of P(F=f, E=e, A=a) where A is an alignment between two sentences.
• Assume each French word has exactly one connection.
20
IBM Model 1- Translation Model• Assume, |e|=l and |f|=m, then the alignment can be
represented by a series a = a1a2…am
• Each alignment is between 0 and l such that if the word in position j of the French sentence is connected to the word in position i of the English sentence, then aj=i and if it not connected to any English word, then aj=0.
• The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l.
21
IBM Model 1 – Translation Model
all possible alignments(the English word that a French
word fj is aligned with)translation probability
𝑃 ( 𝑓 |𝑒 )=∑𝑎𝑝( 𝑓 ,𝑎∨𝑒)
𝑝 ( 𝑓 ,𝑎|𝑒 )= 1(𝑙+1)𝑚
∏𝑗=1
𝑚
𝑝 ( 𝑓 𝑗∨𝑒𝑎 𝑗¿)¿
EM algorithm is used to estimate the translation probabilities.
22
Outline• Motivation & Background
– Language model (LM) for IR – Smoothing methods for IR
• Statistical Machine Translation – Cross-Lingual– Motivation– IBM Model 1
• Statistical Translation Language Model – Monolingual– Synthetic Queries– Mutual Information-based approach– Regularization of self-translation probabilities
• Smoothing in Statistical Translation Language Model
The Problem of Vocabulary Gap Query = auto wash
autowash
…
carwash
vehicle
d1
autobuy…
auto
d2
d3
P(“auto”) P(“wash”)
P(“auto”) P(“wash”)
How to support inexact matching? {“car” , “vehicle”} == “auto” “buy” ==== “wash”
23
Translation Language Models for IR [Berger & Lafferty 99]
Query = auto wash
autowash
…
carwash
vehicle
d1
autobuyauto
d2
d3
“auto”
“car”
“translate” “auto”
Query = car wash
P(“auto”) P(“wash”) “car”
“auto”
P(“car”|d3)
Pt(“auto”| “car”)
“vehicle” P(“vehicle”|d3) Pt(“auto”| “vehicle”)
P(“auto” |d3)= p(“car”|d3) x pt(“auto”| “car”) + p(“vehicle”|d3) x pt(“auto”| “vehicle”)
)|()|()|( uwpdupdwp tu
ml
How to estimate?
24
• When relevance judgments are available, (q,d) serves as data to train the translation model
• Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], <title, body>[Jin et al. 02]
Basic translation model
Translation model Regular doc LM
dw
tt dupuwpdwp )|()|()|(
Estimation of Translation Model: pt(w|u)
• Select words that are representative of a document.
• Calculate a Mutual information for each word in a document: I(w,d) = p(w,d)log
• Synthetic queries are sampled based on normalized mutual information.
• The resulting (d,q) of documents and synthetic queries are used to estimate the probabilities using EM algorithm (IBM Model 1).
Estimation of Translation Model – Synthetic Queries
([Berger & Lafferty 99])
Estimation of Translation Model – Synthetic Queries Algorithm
([Berger & Lafferty 99])
Training data
Limitations: 1.Can’t translate into words not seen
in the training queries 2.Computational complexity
A simpler and more efficient method for estimating pt(w|u) with higher coverage
was proposed in:
M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval.
ACM SIGIR, pages 323-330, 2010
28
Estimation of Translation Model Based on Mutual Information
1. Calculate Mutual information for each pair of two words in the collection (measuring co-occurrences)
2. Normalize mutual information score to obtain a translation probability:
}1,0{ }1,0{ )()(
),(log),();(w uX X uw
uwuwuw XpXp
XXpXXpXXI
'' );(
);()|(
wuw
uwmi XXI
XXIuwp
29
presence/absence of word w in a document
Computation Detail
1,0 1,0 )()(
),(log),();(w uX X uw
uwuwuw XpXp
XXpXXpXXI
Xw=1 Xu=1
NXXcXXp uw
uw)1,1()1,1(
N
NXcXp w
w)1()1(
NXcXp u
u)1()1(
30
Xw Xu
D1 0 0D2 1 1D3 1 0….…DN 0 0
Exploit index to speed up computation
Sample Translation Probabilities (AP90)
q p(q|w)everest 0.079climber 0.042climb 0.0365
mountain 0.0359mount 0.033reach 0.0312
expedit 0.0314summit 0.0253whittak 0.016
peak 0.0149
p(w| “everest”)
q p(q|w)everest 0.1051climber 0.0423mount 0.0339
028 0.0308expedit 0.0303
peak 0.0155himalaya 0.01532
nepal 0.015sherpa 0.01431hillari 0.01431
Mutual Information
31
Synthetic Query
Regularizing Self-Translation Probability
• Self-translation probability can be under-estimated
An exact match would be counted less than an exact match
• Solution: Interpolation with “1.0 self-translation”
)|()|( wwpuwp tt
)|()1(
)|()1()|(
uwpuup
uwpt w = u
a= 1 basic query likelihood modela= 0 original MI estimate
uw
32
Query Likelihood and Translation Language Model
• Document ranking based on query likelihood | |
1 1
1 2
log ( | ) log ( | ) ( , ) log ( | )
, ...
Vm
i i ii i
m
p q d p q d c w q p w d
where q q q q
Document language model
dw
tt dupuwpdwp )|()|()|(
• Translation Language Model
Do you see any problem?
Further Smoothing of Translation Model for Computing Query Likelihood
• Linear interpolation (Jelinek-Mercer):
• Bayesian interpolation (Dirichlet prior):
du
mltt Cwpdupuwpdwp )|()]|()|()[1()|(
du
mltt Cwpd
dupuwpdddwp )|(
||)]|()|([
||||)|(
pml(w|d)
34
pml(w|d)
Experiment Design• MI vs. Synthetic query estimation
– Data Sets: Associated Press (AP90) and San Jose Mercury News (SJMN) + TREC topics 51-100
– Relatively small data sets in order to compare our results with Synthetic queries in [Berger& Lafferty 99].
• MI Translation model vs. Basic query likelihood – Larger Data Sets: TREC7, TREC8 (plus AP90, SJMN) – TREC topics 351-400 for TREC7 and 401-450 for TREC8
• Additional issues– Regularization of self-translation? – Influence of smoothing on translation models? – Translation model + pseudo feedback?
35
Mutual information outperforms synthetic queries in both MAP and P@10
AP90 + queries 51-100, Dirichlet Prior Smoothing
36
Syn. Query
MI
Syn. Query
MI
Upper Bound Comparison of Mutual Information and Synthetic Queries
Dirichlet Prior Smoothing
Data MAP Precision @10Mutual Info Syn. Query Mutual Info. Syn. Query
AP90 0.264* 0.25 0.381 0.357SJMN 0.197* 0.189 0.252 0.267
37
JM Smoothing
Data MAP Precision @10Mutual Info Syn. Query Mutual Info. Syn. Query
AP90 0.272* 0.251 0.423 0.404SJMN 0.2* 0.195 0.28 0.266
Mutual information translation model outperforms basic query likelihood
Data MAP Precision @10Basic QL MI Trans. Basic QL MI Trans.
AP90 0.248 0.272* 0.398 0.423SJMN 0.195 0.2* 0.266 0.28TREC7 0.183 0.187* 0.412 0.404TREC8 0.248 0.249 0.452 0.456
JM Smoothing
Data MAP Precision @10Basic QL MI Trans. Basic QL MI Trans.
AP90 0.246 0.264* 0.357 0.381SJMN 0.188 0.197* 0.252 0.267TREC7 0.165 0.172 0.354 0.362TREC8 0.236 0.244* 0.428 0.436
Dir. Prior Smoothing
38
Translation model appears to need less collection smoothing than basic QL
39
Translation model
Basic query likelihood
Translation model and pseudo feedback exploit word co-occurrences differently
Data MAP Precision @10BL PFB PFB+TM BL PFB PFB+TM
AP90 0.246 0.271 0.298 0.357 0.383 0.411SJMN 0.188 0.229 0.234 0.252 0.316 0.313TREC7 0.165 0.209 0.222 0.354 0.38 0.384TREC8 0.236 0.240 0.281 0.428 0.4 0.452
JM Smoothing
)|(
)|(log)|(qwp
tq dwpwp
Query model from pseudo FB
Smoothed Translation
Model40
Regularization of self-translation is beneficial
AP Data Set, Dirichlet Prior
41
Summary
• Statistical Translation language model are effective for bridging the vocabulary gap.
• Mutual information is more effective and more efficient than synthetic queries for estimating translation model probabilities.
• Regularization of self-translation is beneficial• Translation model outperforms basic query likelihood on
small and large collections and is more robust • Translation model and pseudo feedback exploit word co-
occurrences differently and can be combined to further improve performance
42
References• [1] A. Berger and J. Lafferty. Information Retrieval as Statistical
Translation. ACM SIGIR, pages 222–229, 1999.• [2] P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The
mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.
• [3] M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. ACM SIGIR, pages 323-330, 2010.
43