Download - CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17 · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

CS 572: Information Retrieval

Lecture 11: Topic Models

Acknowledgments: Some slides were adapted from Chris Manning, and

from Thomas Hoffman

2/17/2016 CS572: Information Retrieval. Spring 2016 1

Plan for next few weeks

• Project 1: done (submit by Friday).

• Project 2: (topic) language models: TBA tomorrow

• Monday 2/22: no class. Watch LDA Google talk by David Blei: https://www.youtube.com/watch?v=7BMsuyBPx90

• Wednesday 2/24: guest lecture: Prof. Joyce Ho

• Monday 2/29: Semantics (conclusion); NLP for IR

• Wednesday 3/2: NLP for IR + guest lecture

• Wednesday 3/2: Midterm (take-home) assigned. Due by 5pm Thursday 3/3.


https://www.youtube.com/watch?v=7BMsuyBPx90

3

Today: Can we transform this matrix to identify the "meaning" or topicof the documents, and use that for retrieval/classification, etc?

Recall: Term-document matrix

3

Anthony andCleopatra

Julius Caesar

TheTempest

Hamlet Othello Macbeth

anthony 5.25 3.18 0.0 0.0 0.0 0.35

brutus 1.21 6.10 0.0 1.0 0.0 0.0

caesar 8.59 2.54 0.0 1.51 0.25 0.0

calpurnia 0.0 1.54 0.0 0.0 0.0 0.0

cleopatra 2.85 0.0 0.0 0.0 0.0 0.0

mercy 1.51 0.0 1.90 0.12 5.25 0.88

worser 1.37 0.0 0.11 4.15 0.25 1.95

. . .

2/17/2016 CS572: Information Retrieval. Spring 2016

Problems with Lexical Semantics

• Ambiguity and association in natural language

– Polysemy: Words often have a multitude of meaningsand different types of usage (more severe in very heterogeneous collections).

– The word-based retrieval model is unable to discriminate between different meanings of the same word.

2/17/2016 4CS572: Information Retrieval. Spring 2016

Problems with Lexical Semantics

– Synonymy: Different terms may have identical or similar meanings (weaker: words indicating the same topic).

– No associations between words are made in the matrix or vector space representation.


Polysemy and Context

• Document similarity on single word level: polysemy and context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

…

planet...

contribution to similarity, if

used in 1st meaning, but not

if in 2nd


Solution: Topic Models

• Idea: model words in context (e.g., document)

• Examples:

– Topic models in science:

http://topics.cs.princeton.edu/Science/browser/

– Topic models in javascript (by David Mimno)

http://mimno.infosci.cornell.edu/jsLDA/


http://topics.cs.princeton.edu/Science/browser/

http://mimno.infosci.cornell.edu/jsLDA/

Application: Model Evolution of Topics


Progression of Topic Models

• Latent Semantic Analysis|Indexing (LSI | LSA)

• Probalistic LSI (pLSI)

• Probabilistic LSI with Dirichlet priors (LDA):– Mon 2/22: Google tech talk by David Blei

• Scalable topic models (SVD/NMF, Bayes MF): – Wed 2/24, Prof. Joyce Ho

• Word2Vec, other extensions (Mon 2/29)


Latent Semantic Indexing (LSI)

• Perform a low-rank approximation of document-term matrix (typical rank 100-300)

• General idea

– Map documents (and terms) to a low-dimensionalrepresentation.

– Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).

– Compute document similarity based on the inner product in this latent semantic space


Goals of LSI

• Similar terms map to similar location in low dimensional space

• Noise reduction by dimension reduction


Latent Semantic Analysis

• Latent semantic space: illustrative example

courtesy of Susan Dumais


13

Decompose the term-document matrix into a product of matrices.decomposition: singular value decomposition(SVD).SVD: C = UΣV T (where C = term-documentmatrix)

Then, use the SVD to compute a new, improved term-document matrix C′.Hope: get better similarity values out of C′ (compared to C).Using SVD for this purpose is called latent semantic indexing or LSI.

Latent semantic indexing: Overview


Singular Value Decomposition

TVUA

MM MN V is NN

For an M N matrix A of rank r there exists a factorization

(Singular Value Decomposition = SVD) as follows:

The columns of U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.

ii

rdiag ...1 Singular values.

Eigenvalues 1 … r of AAT are the eigenvalues of ATA.


Singular Value Decomposition

• Illustration of SVD dimensions and sparseness


SVD example

Let

01

10

11

A

M=3, N=2. Its SVD is

2/12/1

2/12/1

00

30

01

3/16/12/1

3/16/12/1

3/16/20

Typically, the singular values arranged in decreasing order.


TVUA

MM MN V is NN

• SVD can be used to compute optimal low-rank approximations.

• Approximation problem: Find Ak of rank k such that

Ak and X are both mn matrices.

Typically, want k << r.

Low-rank Approximation

Frobenius normFkXrankX

k XAA

min)(:


• Solution via SVD

Low-rank Approximation

set smallest r-k

singular values to zero

T

kk VUA )0,...,0,,...,(diag 1

column notation: sum

of rank 1 matrices

T

ii

k

i ik vuA

1

k


• If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red

• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N

• This is referred to as the reduced SVD– It is the convenient (space-saving) and usual form for

computational applications

Reduced SVD

k


Approximation error

• How good (bad) is this approximation?

• It’s the best possible, measured by the Frobenius norm of the error:

where the i are ordered such that i i+1.– Suggests why Frobenius error drops as k increases.

1

)(:min

kFkFkXrankX

AAXA


SVD Low-rank approximation

• Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000)

• We can construct an approximation A100 with rank 100.

– Of all rank 100 matrices, it would have the lowest Frobenius error.

C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

Psychometrika, 1, 211-218, 1936.


Connection to Vector Space Model

• Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.


Intuition from block matrices

Block 1

Block 2

…

Block k

0’s

0’s

= non-zero entries.

m

terms

n documents

What’s the rank of this matrix?



Block 1

Block 2

…

Block k

0’s

0’s

m

terms

n documents

Vocabulary partitioned into k topics (clusters); each doc discusses

only one topic.



Block 1

Block 2

…

Block k

0’s

0’s

= non-zero entries.

m

terms

n documents

What’s the best rank-k

approximation to this matrix?



Block 1

Block 2

…

Block k

Few nonzero entries

Few nonzero entries

wiper

tire

V6

car

automobile

1

1

0

0

Likely there’s a good rank-k

approximation to this matrix.


Assumption/Hope

Topic 1

Topic 2

Topic 32/17/2016 CS572: Information Retrieval. Spring 2016 27

Latent Semantic Indexing by SVD


Performing the maps

• Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.

• Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval.

• A query q is also mapped into this space, by

– Query NOT a sparse vector.

1 kk

T

k Uqq


Performing the maps

• ATA is the dot product of pairs of documentsATA≈Ak

TAk = (UkkVkT)T (UkkVk

T)

= VkkUkT UkkVk

T

= (Vkk) (Vkk)T

• Since Vk = AkTUkk

-1

we should transform query q to qk as follows

qk qTUkk

1

Sec. 18.4


Empirical evidence

• Experiments on TREC 1/2/3 – Dumais

• Lanczos SVD code (available on netlib) due to Berry used in these expts

– Running times of ~ one day on tens of thousands of docs [still an obstacle to use]

• Dimensions – various values 250-350 reported. Reducing k improves recall.

– (Under 200 reported unsatisfactory)

• Generally expect recall to improve – what about precision?

31CS572: Information Retrieval. Spring 20162/17/2016

Empirical evidence: Conclusion

• Precision at or above median TREC precision

– Top scorer on almost 20% of TREC topics

• Slightly better on average than straight vector spaces

• Effect of dimensionality:

Dimensions Precision

250 0.367

300 0.371

346 0.374


Failure modes

• Negated phrases

– TREC topics sometimes negate certain query/terms phrases – automatic conversion of topics to

• Boolean queries

– As usual, freetext/vector space syntax of LSI queries precludes (say) “Find any doc having to do with the following 5 companies”

• See Berry, Dumais for more (resources slide)


LSI has many other applications

• In many settings in pattern recognition and retrieval, we have a feature-object matrix.

– For text, the terms are features and the docs are objects.

Could be opinions & users … Recommender Systems

– This matrix may be redundant in dimensionality.

– Can work with low-rank approximation.

– If entries are missing (e.g., users’ opinions), can recover if dimensionality is low.


Resources

• http://www.cs.utk.edu/~berry/lsi++/

• http://lsi.argreenhouse.com/lsi/LSIpapers.html

• Dumais (1993) LSI meets TREC: A status report.

• Dumais (1994) Latent Semantic Indexing (LSI) and TREC-2.

• Dumais (1995) Using LSI for information filtering: TREC-3 experiments.

• M. Berry, S. Dumais and G. O'Brien. Using linear algebra for

intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.


http://www.cs.utk.edu/~berry/lsi++/

http://lsi.argreenhouse.com/lsi/LSIpapers.html

44

Probabilistic View: Topic Language Models

query

d1

d2

dn

…

Information need

document collection

generation

),,|( dTC MMMQP

1dM

2dM

…

ndM

CM

1TM

2TM

mTM

…),|( TC MMQP


Latent Aspects: Example


(probabilistic) LSI: pLSI


48

Aspect Model

• Generation process– Choose a doc d with prob P(d)

• There are N d’s

– Choose a latent class z with (generated) prob P(z|d)

• There are K z’s, and K << N

– Generate a word w with (generated) prob P(w|z)

– This creates pair (d, w), without direct concern for z

• Joining the probabilities:

Remember: P(z|d) means “probability of z, given d”

K chosen in advance (how

many topics in collection???)


49

Aspect Model (2)

• Log-likelihood– Maximize this to find P(d), P(z|d), P(w|z)

• Apply Bayes theorem: end up with

• What is modeled? – Doc-specific word distributions, P(w|d), are based on

combination of specific classes/factors/aspects, P(w|z)

– Not just assigned to nearest cluster


pLSI Learning


pLSI Generative Model


52

Approach: Expectation Maximization (EM)

• EM is popular technique to maximize likelihood estimation

• Alternates between:

– E-step: calculate future probabilities of z based on current estimates

– M-step: update estimate parameters based on calculated probabilities


Simple example

0 1-2 -1-4 -3 4 52 3

Data:

Model:

Parameters:

OBJECTIVE: Fit mixture of Gaussian model with C=2 components

keep fixed

i.e. only estimate

x

P(x|)

where


Likelihood function

Likelihood is a function of parameters,

Probability is a function of r.v. xDIFFERENT from LAST PLOT


Probabilistic model

Imagine model generating data

Need to introduce label, z, for each data

point

Label is called a latent variable

also called hidden, unobserved, missing

c

0 1-2 -1-4 -3 4 52 3

Simplifies the problem:

if we knew the labels, we can decouple the components as estimate

parameters separately for each one


Intuition of EME-step: Compute a distribution on the labels of the points, using current parameters

M-step: Update parameters using current guess of label distribution.

E

E

M

M

E2/17/2016 56CS572: Information Retrieval. Spring 2016

EM for pLSI


pLSA for IR: T. Hoffman 2000

• MED – 1033 docs

• CRAN – 1400 docs

• CACM – 3204 docs

• CISI – 1460 docs

• Reporting best results with K varying from 32, 48, 64, 80, 128

• pLSA* model takes the average across all models at different K values


Example of topics found from a Science

Magazine papers collection


Using Aspects for Query Expansion


Relevance Results

• Cosine Similarity is the baseline

• In LSI, query vector(q) is multiplied to get the reduced space vector

• In PLSI, p(z|d) and p(z|q). In EM iterations, only P(z|q) is adapted


Precision-Recall results(4/4)


63

Experiment: pLSI w/ 128-factor decomposition


Extension: Document Priors

• Model the document “prior”

• LDA: extension of pLSI to better model document generation process [David Blei]

• Video: https://www.youtube.com/watch?v=7BMsuyBPx90

• Lecture slides: http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf


https://www.youtube.com/watch?v=7BMsuyBPx90

http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf