Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

15
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei

Transcript of Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Page 1: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Topic Models in Text Processing

IR Group MeetingPresented by Qiaozhu Mei

Page 2: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Topic Models in Text Processing

• Introduction• Basic Topic Models

– Probabilistic Latent Semantic Indexing– Latent Dirichlet Allocation– Correlated Topic Models

• Variations and Applications

Page 3: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Overview

• Motivation:– Model the topic/subtopics in text collections

• Basic Assumptions:– There are k topics in the whole collection– Each topic is represented by a multinomial distribution

over the vocabulary (language model)– Each document can cover multiple topics

• Applications– Summarizing topics– Predict topic coverage for documents– Model the topic correlations

Page 4: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Basic Topic Models

• Unigram model

• Mixture of unigrams

• Probabilistic LSI

• LDA

• Correlated Topic Models

Page 5: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Unigram Model

• There is only one topic in the collection

• Estimation:– Maximum likelihood estimation

• Application: LMIR,

N

nnwpp

1

)()(w

M: Number of Documents

N: Number of words in ww: a document

in the collectionwn: a word in w

Page 6: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Mixture of unigrams

• There is k topics in the collection, but each document only cover one topic

• Estimation: MLE and EM Algorithm• Application: simple document clustering

z

N

nn zwpzpp

1

)|()()(w

Page 7: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Probabilistic Latent Semantic Indexing

• (Thomas Hofmann ’99): each document is generated from more than one topics, with a set of document specific mixture weights {p(z|d)} over k topics.

• These mixture weights are considered as fixed parameters to be estimated.

• Also known as aspect model. • No prior knowledge about topics required,

context and term co-occurrences are exploited

Page 8: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

PLSI (cont.)

• Assume uniform p(d)• Parameter Estimation:

– π: {p(z|d)}; : {p(w|z)}– Maximizing log-likelihood using EM algorithm

z

nn dzpzwpdpwdp )|()|()(),(

d: document

Page 9: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Problem of PLSI

• Mixture weights are considered as document specific, thus no natural way to assign probability to a previously unseen document.

• Number of parameters to be estimated grows with size of training set, thus overfits data, and suffers from multiple local maxima.

• Not a fully generative model of documents.

Page 10: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Latent Dirichlet Allocation

• (Blei et al ‘03) Treats the topic mixture weights as a k-parameter hidden random variable (a multinomial) and places a Dirichlet prior on the multinomial mixing weights. This is sampled once per document.

• The weights for word multinomial distributions are still considered as fixed parameters to be estimated.

• For a fuller Bayesian approach, can place a Dirichlet prior to these word multinomial distributions to smooth the probabilities. (like Dirichlet smoothing)

Page 11: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Dirichlet Distribution

• Bayesian Theory: Everything is not fixed, everything is random. • Choose prior for the multinomial distribution of the topic mixture w

eights: Dirichlet distribution is conjugate to the multinomial distribution, which is natural to choose.

• A k-dimensional Dirichlet random variable can take values in the (k-1)-simplex, and has the following probability density on this simplex:

111

1

1 1

)(

)()()1(

kkk

i i

ki ip

Page 12: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

The Graph Model of LDA

M

dd

kN

n zdndnddnd

kN

n znnn

nn

N

nn

dzwpzppDp

dzwpzppp

zwpzppp

d

dn

n

1 1

1

1

),()()(),(

),()()(),()3(

),()()(),,,()2(

w

wz

Page 13: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Generating Process of LDA

• Choose• Choose• For each of the N words :

– Choose a topic– Choose a word from , a multinomial

probability conditioned on the topic . β is a k by n matrix parameterized with the word probabilities.

nw

nw ),( nn zwp

nz

)11(][ ij

ijVk zwp

Page 14: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Inference and Parameter Estimation

• Parameters: – Corpus level: α, β: sampled once in the corpus creatin

g generative model.– Document level: , z: sampled once to generate a doc

ument

• Inference: estimation of document-level parameters.

• However, it is intractable to compute, which needs approximate inference.

Page 15: Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Variantional Inference