CS 572: Information Retrieval
Lecture 11: Topic Models
Acknowledgments: Some slides were adapted from Chris Manning, and
from Thomas Hoffman
2/17/2016 CS572: Information Retrieval. Spring 2016 1
Plan for next few weeks
• Project 1: done (submit by Friday).
• Project 2: (topic) language models: TBA tomorrow
• Monday 2/22: no class. Watch LDA Google talk by David Blei: https://www.youtube.com/watch?v=7BMsuyBPx90
• Wednesday 2/24: guest lecture: Prof. Joyce Ho
• Monday 2/29: Semantics (conclusion); NLP for IR
• Wednesday 3/2: NLP for IR + guest lecture
• Wednesday 3/2: Midterm (take-home) assigned. Due by 5pm Thursday 3/3.
2/17/2016 CS572: Information Retrieval. Spring 2016 2
3
Today: Can we transform this matrix to identify the "meaning" or topicof the documents, and use that for retrieval/classification, etc?
Recall: Term-document matrix
3
Anthony andCleopatra
Julius Caesar
TheTempest
Hamlet Othello Macbeth
anthony 5.25 3.18 0.0 0.0 0.0 0.35
brutus 1.21 6.10 0.0 1.0 0.0 0.0
caesar 8.59 2.54 0.0 1.51 0.25 0.0
calpurnia 0.0 1.54 0.0 0.0 0.0 0.0
cleopatra 2.85 0.0 0.0 0.0 0.0 0.0
mercy 1.51 0.0 1.90 0.12 5.25 0.88
worser 1.37 0.0 0.11 4.15 0.25 1.95
. . .
2/17/2016 CS572: Information Retrieval. Spring 2016
Problems with Lexical Semantics
• Ambiguity and association in natural language
– Polysemy: Words often have a multitude of meaningsand different types of usage (more severe in very heterogeneous collections).
– The word-based retrieval model is unable to discriminate between different meanings of the same word.
2/17/2016 4CS572: Information Retrieval. Spring 2016
Problems with Lexical Semantics
– Synonymy: Different terms may have identical or similar meanings (weaker: words indicating the same topic).
– No associations between words are made in the matrix or vector space representation.
2/17/2016 5CS572: Information Retrieval. Spring 2016
Polysemy and Context
• Document similarity on single word level: polysemy and context
carcompany
•••dodgeford
meaning 2
ringjupiter
•••space
voyagermeaning 1…
saturn...
…
planet...
contribution to similarity, if
used in 1st meaning, but not
if in 2nd
2/17/2016 6CS572: Information Retrieval. Spring 2016
Solution: Topic Models
• Idea: model words in context (e.g., document)
• Examples:
– Topic models in science:
http://topics.cs.princeton.edu/Science/browser/
– Topic models in javascript (by David Mimno)
http://mimno.infosci.cornell.edu/jsLDA/
2/17/2016 7CS572: Information Retrieval. Spring 2016
Application: Model Evolution of Topics
2/17/2016 8CS572: Information Retrieval. Spring 2016
Progression of Topic Models
• Latent Semantic Analysis|Indexing (LSI | LSA)
• Probalistic LSI (pLSI)
• Probabilistic LSI with Dirichlet priors (LDA):– Mon 2/22: Google tech talk by David Blei
• Scalable topic models (SVD/NMF, Bayes MF): – Wed 2/24, Prof. Joyce Ho
• Word2Vec, other extensions (Mon 2/29)
2/17/2016 9CS572: Information Retrieval. Spring 2016
Latent Semantic Indexing (LSI)
• Perform a low-rank approximation of document-term matrix (typical rank 100-300)
• General idea
– Map documents (and terms) to a low-dimensionalrepresentation.
– Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).
– Compute document similarity based on the inner product in this latent semantic space
2/17/2016 10CS572: Information Retrieval. Spring 2016
Goals of LSI
• Similar terms map to similar location in low dimensional space
• Noise reduction by dimension reduction
2/17/2016 11CS572: Information Retrieval. Spring 2016
Latent Semantic Analysis
• Latent semantic space: illustrative example
courtesy of Susan Dumais
2/17/2016 12CS572: Information Retrieval. Spring 2016
13
Decompose the term-document matrix into a product of matrices.decomposition: singular value decomposition(SVD).SVD: C = UΣV T (where C = term-documentmatrix)
Then, use the SVD to compute a new, improved term-document matrix C′.Hope: get better similarity values out of C′ (compared to C).Using SVD for this purpose is called latent semantic indexing or LSI.
Latent semantic indexing: Overview
132/17/2016 CS572: Information Retrieval. Spring 2016
Singular Value Decomposition
TVUA
MM MN V is NN
For an M N matrix A of rank r there exists a factorization
(Singular Value Decomposition = SVD) as follows:
The columns of U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
ii
rdiag ...1 Singular values.
Eigenvalues 1 … r of AAT are the eigenvalues of ATA.
2/17/2016 CS572: Information Retrieval. Spring 2016 14
Singular Value Decomposition
• Illustration of SVD dimensions and sparseness
2/17/2016 15CS572: Information Retrieval. Spring 2016
SVD example
Let
01
10
11
A
M=3, N=2. Its SVD is
2/12/1
2/12/1
00
30
01
3/16/12/1
3/16/12/1
3/16/20
Typically, the singular values arranged in decreasing order.
2/17/2016 CS572: Information Retrieval. Spring 2016 16
TVUA
MM MN V is NN
• SVD can be used to compute optimal low-rank approximations.
• Approximation problem: Find Ak of rank k such that
Ak and X are both mn matrices.
Typically, want k << r.
Low-rank Approximation
Frobenius normFkXrankX
k XAA
min)(:
2/17/2016 17CS572: Information Retrieval. Spring 2016
• Solution via SVD
Low-rank Approximation
set smallest r-k
singular values to zero
T
kk VUA )0,...,0,,...,(diag 1
column notation: sum
of rank 1 matrices
T
ii
k
i ik vuA
1
k
2/17/2016 CS572: Information Retrieval. Spring 2016 18
• If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red
• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N
• This is referred to as the reduced SVD– It is the convenient (space-saving) and usual form for
computational applications
Reduced SVD
k
192/17/2016 CS572: Information Retrieval. Spring 2016
Approximation error
• How good (bad) is this approximation?
• It’s the best possible, measured by the Frobenius norm of the error:
where the i are ordered such that i i+1.– Suggests why Frobenius error drops as k increases.
1
)(:min
kFkFkXrankX
AAXA
2/17/2016 20CS572: Information Retrieval. Spring 2016
SVD Low-rank approximation
• Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000)
• We can construct an approximation A100 with rank 100.
– Of all rank 100 matrices, it would have the lowest Frobenius error.
C. Eckart, G. Young, The approximation of a matrix by another of lower rank.
Psychometrika, 1, 211-218, 1936.
2/17/2016 CS572: Information Retrieval. Spring 2016 21
Connection to Vector Space Model
• Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.
2/17/2016 CS572: Information Retrieval. Spring 2016 22
Intuition from block matrices
Block 1
Block 2
…
Block k
0’s
0’s
= non-zero entries.
m
terms
n documents
What’s the rank of this matrix?
2/17/2016 CS572: Information Retrieval. Spring 2016 23
Intuition from block matrices
Block 1
Block 2
…
Block k
0’s
0’s
m
terms
n documents
Vocabulary partitioned into k topics (clusters); each doc discusses
only one topic.
2/17/2016 CS572: Information Retrieval. Spring 2016 24
Intuition from block matrices
Block 1
Block 2
…
Block k
0’s
0’s
= non-zero entries.
m
terms
n documents
What’s the best rank-k
approximation to this matrix?
2/17/2016 CS572: Information Retrieval. Spring 2016 25
Intuition from block matrices
Block 1
Block 2
…
Block k
Few nonzero entries
Few nonzero entries
wiper
tire
V6
car
automobile
1
1
0
0
Likely there’s a good rank-k
approximation to this matrix.
2/17/2016 CS572: Information Retrieval. Spring 2016 26
Assumption/Hope
Topic 1
Topic 2
Topic 32/17/2016 CS572: Information Retrieval. Spring 2016 27
Latent Semantic Indexing by SVD
2/17/2016 28CS572: Information Retrieval. Spring 2016
Performing the maps
• Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.
• Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval.
• A query q is also mapped into this space, by
– Query NOT a sparse vector.
1 kk
T
k Uqq
2/17/2016 29CS572: Information Retrieval. Spring 2016
Performing the maps
• ATA is the dot product of pairs of documentsATA≈Ak
TAk = (UkkVkT)T (UkkVk
T)
= VkkUkT UkkVk
T
= (Vkk) (Vkk)T
• Since Vk = AkTUkk
-1
we should transform query q to qk as follows
qk qTUkk
1
Sec. 18.4
302/17/2016 CS572: Information Retrieval. Spring 2016
Empirical evidence
• Experiments on TREC 1/2/3 – Dumais
• Lanczos SVD code (available on netlib) due to Berry used in these expts
– Running times of ~ one day on tens of thousands of docs [still an obstacle to use]
• Dimensions – various values 250-350 reported. Reducing k improves recall.
– (Under 200 reported unsatisfactory)
• Generally expect recall to improve – what about precision?
31CS572: Information Retrieval. Spring 20162/17/2016
322/17/2016 CS572: Information Retrieval. Spring 2016
332/17/2016 CS572: Information Retrieval. Spring 2016
342/17/2016 CS572: Information Retrieval. Spring 2016
352/17/2016 CS572: Information Retrieval. Spring 2016
362/17/2016 CS572: Information Retrieval. Spring 2016
372/17/2016 CS572: Information Retrieval. Spring 2016
382/17/2016 CS572: Information Retrieval. Spring 2016
392/17/2016 CS572: Information Retrieval. Spring 2016
Empirical evidence: Conclusion
• Precision at or above median TREC precision
– Top scorer on almost 20% of TREC topics
• Slightly better on average than straight vector spaces
• Effect of dimensionality:
Dimensions Precision
250 0.367
300 0.371
346 0.374
2/17/2016 40CS572: Information Retrieval. Spring 2016
Failure modes
• Negated phrases
– TREC topics sometimes negate certain query/terms phrases – automatic conversion of topics to
• Boolean queries
– As usual, freetext/vector space syntax of LSI queries precludes (say) “Find any doc having to do with the following 5 companies”
• See Berry, Dumais for more (resources slide)
412/17/2016 CS572: Information Retrieval. Spring 2016
LSI has many other applications
• In many settings in pattern recognition and retrieval, we have a feature-object matrix.
– For text, the terms are features and the docs are objects.
Could be opinions & users … Recommender Systems
– This matrix may be redundant in dimensionality.
– Can work with low-rank approximation.
– If entries are missing (e.g., users’ opinions), can recover if dimensionality is low.
2/17/2016 42CS572: Information Retrieval. Spring 2016
Resources
• http://www.cs.utk.edu/~berry/lsi++/
• http://lsi.argreenhouse.com/lsi/LSIpapers.html
• Dumais (1993) LSI meets TREC: A status report.
• Dumais (1994) Latent Semantic Indexing (LSI) and TREC-2.
• Dumais (1995) Using LSI for information filtering: TREC-3 experiments.
• M. Berry, S. Dumais and G. O'Brien. Using linear algebra for
intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.
2/17/2016 CS572: Information Retrieval. Spring 2016 43
44
Probabilistic View: Topic Language Models
query
d1
d2
dn
…
Information need
document collection
generation
),,|( dTC MMMQP
1dM
2dM
…
ndM
CM
1TM
2TM
mTM
…),|( TC MMQP
2/17/2016 CS572: Information Retrieval. Spring 2016
Latent Aspects: Example
2/17/2016 45CS572: Information Retrieval. Spring 2016
2/17/2016 46CS572: Information Retrieval. Spring 2016
(probabilistic) LSI: pLSI
2/17/2016 47CS572: Information Retrieval. Spring 2016
48
Aspect Model
• Generation process– Choose a doc d with prob P(d)
• There are N d’s
– Choose a latent class z with (generated) prob P(z|d)
• There are K z’s, and K << N
– Generate a word w with (generated) prob P(w|z)
– This creates pair (d, w), without direct concern for z
• Joining the probabilities:
Remember: P(z|d) means “probability of z, given d”
K chosen in advance (how
many topics in collection???)
2/17/2016 CS572: Information Retrieval. Spring 2016
49
Aspect Model (2)
• Log-likelihood– Maximize this to find P(d), P(z|d), P(w|z)
• Apply Bayes theorem: end up with
• What is modeled? – Doc-specific word distributions, P(w|d), are based on
combination of specific classes/factors/aspects, P(w|z)
– Not just assigned to nearest cluster
2/17/2016 CS572: Information Retrieval. Spring 2016
pLSI Learning
2/17/2016 50CS572: Information Retrieval. Spring 2016
pLSI Generative Model
2/17/2016 51CS572: Information Retrieval. Spring 2016
52
Approach: Expectation Maximization (EM)
• EM is popular technique to maximize likelihood estimation
• Alternates between:
– E-step: calculate future probabilities of z based on current estimates
– M-step: update estimate parameters based on calculated probabilities
2/17/2016 CS572: Information Retrieval. Spring 2016
Simple example
0 1-2 -1-4 -3 4 52 3
Data:
Model:
Parameters:
OBJECTIVE: Fit mixture of Gaussian model with C=2 components
keep fixed
i.e. only estimate
x
P(x|)
where
2/17/2016 53CS572: Information Retrieval. Spring 2016
Likelihood function
Likelihood is a function of parameters,
Probability is a function of r.v. xDIFFERENT from LAST PLOT
2/17/2016 54CS572: Information Retrieval. Spring 2016
Probabilistic model
Imagine model generating data
Need to introduce label, z, for each data
point
Label is called a latent variable
also called hidden, unobserved, missing
c
0 1-2 -1-4 -3 4 52 3
Simplifies the problem:
if we knew the labels, we can decouple the components as estimate
parameters separately for each one
2/17/2016 55CS572: Information Retrieval. Spring 2016
Intuition of EME-step: Compute a distribution on the labels of the points, using current parameters
M-step: Update parameters using current guess of label distribution.
E
E
M
M
E2/17/2016 56CS572: Information Retrieval. Spring 2016
EM for pLSI
2/17/2016 57CS572: Information Retrieval. Spring 2016
pLSA for IR: T. Hoffman 2000
• MED – 1033 docs
• CRAN – 1400 docs
• CACM – 3204 docs
• CISI – 1460 docs
• Reporting best results with K varying from 32, 48, 64, 80, 128
• pLSA* model takes the average across all models at different K values
2/17/2016 58CS572: Information Retrieval. Spring 2016
Example of topics found from a Science
Magazine papers collection
2/17/2016 59CS572: Information Retrieval. Spring 2016
Using Aspects for Query Expansion
2/17/2016 60CS572: Information Retrieval. Spring 2016
Relevance Results
• Cosine Similarity is the baseline
• In LSI, query vector(q) is multiplied to get the reduced space vector
• In PLSI, p(z|d) and p(z|q). In EM iterations, only P(z|q) is adapted
2/17/2016 61CS572: Information Retrieval. Spring 2016
Precision-Recall results(4/4)
2/17/2016 62CS572: Information Retrieval. Spring 2016
63
Experiment: pLSI w/ 128-factor decomposition
2/17/2016 CS572: Information Retrieval. Spring 2016
Extension: Document Priors
• Model the document “prior”
• LDA: extension of pLSI to better model document generation process [David Blei]
• Video: https://www.youtube.com/watch?v=7BMsuyBPx90
• Lecture slides: http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf
2/17/2016 CS572: Information Retrieval. Spring 2016 64
Top Related