Language Models
description
Transcript of Language Models
![Page 1: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/1.jpg)
Language ModelsNaama Kraus
Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze
![Page 2: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/2.jpg)
IR approaches• Boolean retrieval
– Boolean constrains of term occurrences in documents – no ranking
• Vector space model– Queries and vectors are represented as vectors in a high
dimensional space– Notions of similarity (cosine similarity) implying ranking
• Probabilistic model– Rank documents by the probability P(R|d,q)– Estimate P(R|d,q) using relevance feedback technique
• Language Models – today’s class
![Page 3: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/3.jpg)
Intuition
• Users who try to think of a good query, think of words that are likely to appear in relevant documents
• Language model approach:• A document is a good match to a query, if the
document model is likely to generate the query– If document contains query words often
![Page 4: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/4.jpg)
Illustration
LanguageModel
document
query
![Page 5: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/5.jpg)
Traditional language model
• Finite automata• Generative model
I wish
I wishI wish I wishI wish I wish I wish……
The language of the automaton: the full set of strings that it can generate
![Page 6: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/6.jpg)
Probabilistic language model
• Each node has a probability distribution over generating different terms
• A language model is a function that puts a probability measure over strings drawn from some vocabulary
![Page 7: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/7.jpg)
Language model example
s
the 0.2a 0.1frog 0.01toad 0.01said 0.03likes 0.02that 0.04…..STOP 0.2
state emission probabilities(partial)
unigram language model
P(frog said that toad likes frog) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01
(We ignore continue/stop probabilities assuming they are fixed for all queries)
Probability that some text (e.g. a query) was generated by the model:
![Page 8: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/8.jpg)
Query likelihood
s frog said that toad likes that dog
M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005
M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01
q = frog likes toad
P(q | M1) = 0.01 x 0.02 x 0.01P(q | M2) = 0.0002 x 0.04 x 0.0001
P(q|M1) > P(q|M2)
=> M1 is more likely to generate query q
![Page 9: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/9.jpg)
Types of language models
How do we build probabilities over sequence of terms?
P(t1 t2 t3 t4) = P(t1) x P(t2|t1) x P(t3|t1 t2) x P(t4|t1 t2 t3)
Unigram language model – most simplest ; no conditioning context
P(t1 t2 t3 t4) = P(t1) x P(t2) x P(t3) x P(t4)
Bigram language model – condition on previous term
P(t1 t2 t3 t4) = P(t1) x P(t2|t1) x P(t3|t2) x P(t4|t3)
Trigram language model …
Unigram model is the most common in IR • Often sufficient to judge the topic of a document• Data sparseness issues when using richer models• Simple and efficient implementation
![Page 10: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/10.jpg)
The query likelihood model
• Goal: rank documents by P(d|q)– The probability that a user querying q , had the
document d in mind• Bayes Rule: P(d|q) = P(q|d)P(d)/P(q)• P(q) – same for all documents ignored• P(d) – often treated as uniform across
documents ignored– Could be non uniform prior based on criteria like authority, length,
genre, newness …
• Rank by P(q|d)
![Page 11: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/11.jpg)
The query likelihood model (2)
• P(q|d) - the probability that a query q was generated by a language model derived from document d– The probability that a query would be observed as a
random sample from the respective document model
• Algorithm:1. Infer a LM Md for each document d2. Estimate P(q|Md)3. Rank the documents according to these
probabilities
![Page 12: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/12.jpg)
Illustration
d1Md1
query
d2Md2
d3Md3
P(q|Md1)
P(q|Md2)
P(q|Md3)
E.g., P(q|Md3) > P(q|Md1) > P(q|Md2) d3 is first, d1 is second, d2 is third
![Page 13: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/13.jpg)
Estimating P(q|Md)
Use Maximum Likelihood Estimation - MLE
Assume a unigram language model (terms occur independently)
unigram MLE
![Page 14: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/14.jpg)
Sparse data problem
• Documents are sparse– Some words don’t appear in the document– In particular, some of the query terms
• P(q|d) = 0 ; zero probability problem– Conjunctive semantics
• Occurring words are poorly estimated– A single documents is small training set– Occurring words are over estimated
• Their occurrence was partly by chance
![Page 15: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/15.jpg)
Solution: smoothing
• Smooth probabilities in LMs– overcome zero probabilities – give some probability mass to unseen words
• The probability of a non occurring term should be close to its probability to occur in the collection
P(t|Mc) = cf(t)/T• cf(t) = #occurrences of term t in the collection• T – length of the collection = sum of all
document lengths
![Page 16: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/16.jpg)
Smoothing methods
Linear Interpolation
Bayesian smoothing
Summary, with linear interpolationIn practice, log in takenfrom both sides of the equationto avoid multiplying many small numbers
![Page 17: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/17.jpg)
Exercise
Given a collection of two documents D1 , D2
D1: Xyzzy reports a profit but revenue is downD2: Quorus narrows quarter loss but revenue decreases further
A user submitted the query: “revenue down”
Rank D1 and D2 -Use an MLE unigram model and a linear interpolation smoothingwith lambda parameter 0.5
![Page 18: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/18.jpg)
Extended LM approaches
query querymodel
document Documentmodel
query likelihood
document likelihoodmodel comparison
Query likelihood P(q|d) – the probability of document LM to generate query we’ve seen in previous slides …Document likelihood P(d|q) – the probability of query LM to generate document in the next slides …Model comparison R(d;q) – compare between document and query models in the next slides …
P(t|query)
P(t|document)
![Page 19: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/19.jpg)
Document likelihood model
• P(d|q) – the probability of query LM to generate document
• Problem: queries are short bad model estimation
• [Zhai and Lafferty 2001] – Expand the query with terms taken from relevant
documents in the usual way and hence update the language mode
![Page 20: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/20.jpg)
KL divergence• Kullback–Leibler (KL) divergence• An asymmetric divergence measure from information theory• Measures the difference between two probability distributions P , Q
• Typically Q is an estimation of P
Properties• Non negative• Equals 0 iff P equals Q• May have an infinite value• Asymmetric, thus not a metric
• Jensen–Shannon (JS) divergence• Based on KL divergence (D)• Always finite• 0 <= JSD <= 1• Symmetric
![Page 21: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/21.jpg)
Model comparison
Make LM from both query and document
Measure `how different` these LMs from each other
Use KL divergence
Rank by KLD - the closer to 0 the higher is the rank
![Page 22: Language Models](https://reader035.fdocuments.us/reader035/viewer/2022062811/5681617c550346895dd10b7b/html5/thumbnails/22.jpg)
Language models - summary• Probabilistic model
– mathematically precise• Intuitive, simple concept• Achieves very good retrieval results
– Still, no evidence that it exceeds the traditional vector space model
• Relation to the Vector Space Model– Both use term frequency– Smoothing with collection generation probability is a little like idf
• Terms rare in the general collection but common in some documents will have a greater influence on the document’s ranking
– Probabilistic vs. geometric– Mathematical mode vs. heuristic model