Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

Post on 09-Jul-2020

8 views 0 download

Transcript of Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

600.465 - Intro to NLP - J. Eisner 1

Topic Modeling

600.465 - Intro to NLP - J. Eisner 2

Embedding Documents A trick from Information Retrieval Each document in corpus is a length-k vector Or each paragraph, or whatever

(0, 3, 3, 1, 0, 7, . . . 1, 0)

a single documentPretty sparse, so pretty noisy! Hard to tell which docs are similar.

Wish we could smooth the vector:what would the document look like if it rambled on forever?

600.465 - Intro to NLP - J. Eisner 3

Latent Semantic Analysis A trick from Information Retrieval Each document in corpus is a length-k vector Plot all documents in corpus

True plot in k dimensionsReduced-dimensionality plot

a clustering algorithmmight discover thiscluster of “similar”documents – similarthanks to smoothing!

600.465 - Intro to NLP - J. Eisner 4

Latent Semantic Analysis Reduced plot is a perspective drawing of true plot It projects true plot onto a few axes a best choice of axes – shows most variation in the data. Found by linear algebra: “Principal Components Analysis” (PCA)

True plot in k dimensionsReduced-dimensionality plot

600.465 - Intro to NLP - J. Eisner 5

Latent Semantic Analysis SVD plot allows best possible reconstruction of true plot

(i.e., can recover 3-D coordinates with minimal distortion) Ignores variation in the axes that it didn’t pick Hope that variation’s just noise and we want to ignore it

True plot in k dimensionsReduced-dimensionality plot

topic A

topi

c B

600.465 - Intro to NLP - J. Eisner 6

Latent Semantic Analysis SVD finds a small number of topic vectors Approximates each doc as linear combination of topics Coordinates in reduced plot = linear coefficients How much of topic A in this document? How much of topic B? Each topic is a collection of words that tend to appear together

True plot in k dimensionsReduced-dimensionality plot

topic A

topi

c B

600.465 - Intro to NLP - J. Eisner 7

Latent Semantic Analysis New coordinates might actually be useful for Info Retrieval To compare 2 documents, or a query and a document: Project both into reduced space: do they have topics in common? Even if they have no words in common!

True plot in k dimensionsReduced-dimensionality plot

topic A

topi

c B

600.465 - Intro to NLP - J. Eisner 8

Latent Semantic Analysis topics extracted for IR might help sense disambiguation

Each word is like a tiny document: (0,0,0,1,0,0,…) Express word as a linear combination of topics Each topic corresponds to a sense? E.g., “Jordan” has Mideast and Sports topics

(plus Advertising topic, alas, which is same sense as Sports) Word’s sense in a document: which of its topics are strongest in the

document? Groups senses as well as splitting them One word has several topics and many words have same topic

600.465 - Intro to NLP - J. Eisner 9

Latent Semantic Analysis A perspective on Principal Components Analysis (PCA) Imagine an electrical circuit that connects terms to docs …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrix of strengths(how strong is each

term in each document?)

Each connection has aweight given by the matrix.

600.465 - Intro to NLP - J. Eisner 10

Latent Semantic Analysis Which documents is term 5 strong in?

docs 2, 5, 6light up strongest.

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

600.465 - Intro to NLP - J. Eisner 11

This answers a queryconsisting of terms 5 and 8!

really just matrix multiplication:term vector (query) x strength matrix = doc vector .

Latent Semantic Analysis Which documents are terms 5 and 8 strong in?

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

600.465 - Intro to NLP - J. Eisner 12

Latent Semantic Analysis Conversely, what terms are strong in document 5?

gives doc 5’s coordinates!

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

600.465 - Intro to NLP - J. Eisner 13

Latent Semantic Analysis SVD approximates by smaller 3-layer network Forces sparse data through a bottleneck, smoothing it

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

topics

600.465 - Intro to NLP - J. Eisner 14

Latent Semantic Analysis I.e., smooth sparse data by matrix approx: M A B A encodes camera angle, B gives each doc’s new coords

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrixM

A

B

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

topics

600.465 - Intro to NLP - J. Eisner 15

Latent Semantic AnalysisCompletely symmetric! Regard A, B as projecting terms and docsinto a low-dimensional “topic space” where their similarity can bejudged.

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrixM

A

B

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

topics

600.465 - Intro to NLP - J. Eisner 16

Latent Semantic Analysis Completely symmetric. Regard A, B as projecting terms and docs

into a low-dimensional “topic space” where their similarity can bejudged.

Cluster documents (helps sparsity problem!) Cluster words Compare a word with a doc Identify a word’s topics with its senses sense disambiguation by looking at document’s senses

Identify a document’s topics with its topics topic categorization

600.465 - Intro to NLP - J. Eisner 17

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

SVD actually decomposes M = A D B’ exactly A = camera angle (orthonormal); D diagonal; B’ orthonormal

matrixM

A

B’

D

600.465 - Intro to NLP - J. Eisner 18

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units

matrixM

A

B’

D

600.465 - Intro to NLP - J. Eisner 19

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units

matrixM

A

B’

D

600.465 - Intro to NLP - J. Eisner 20

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

To simplify picture, can write M A (DB’) = AB

matrixM

A

B =DB’

How should you pick j (number of blue units)? Just like picking number of clusters: How well does system work with each j (on held-out data)?

21

Directed graphical models (Bayes nets)

slide thanks to Zoubin Ghahramani (modified)

Under any model: p(A, B, C, D, E) = p(A)p(B|A)p(C|A,B)p(D|A,B,C)p(E|A,B,C,D)Model above says:

22

Unigram model for generating text

w1 w2 w3…

p(w1) p(w2) p(w3) …

23

Explicitly show model’s parameters

w1 w2 w3…

p() p(w1 | ) p(w2 | ) p(w3 | ) …

“ is a vector that sayswhich unigrams are likely”

24

“Plate notation” simplifies diagram

p() p(w1 | ) p(w2 | ) p(w3 | ) …

wN1

“ is a vector that sayswhich unigrams are likely”

25

Learn from observed words(rather than vice-versa)

p() p(w1 | ) p(w2 | ) p(w3 | ) …

wN1

26

Explicitly show prior over (e.g., Dirichlet)

wN1

p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …

“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess” given

Dirichlet()wi

27

dog the cat

0

Dirichlet DistributionEach point on a k dimensional simplex is a multinomial probabilitydistribution:

1

1

1

1

3

2

3.0

5.0

2.0

dog the cat

0

0

1

1i

i

1i

i

slide thanks to Nigel Crook

28

Dirichlet Distribution

0 1

1

1

3

2

1

A Dirichlet Distribution is a distribution over multinomialdistributions in the simplex.

0 1

1

1

1

3

21

1

3

11

12

slide thanks to Nigel Crook

29

slide thanks to Percy Liang and Dan Klein

30

Dirichlet DistributionExample draws from a Dirichlet Distribution over the 3-simplex:

0

1

1

2

3

1

2

0

3

1

2

3

Dirichlet(5,5,5)

Dirichlet(0.2, 5, 0.2)

Dirichlet(0.5,0.5,0.5)

slide thanks to Nigel Crook

31

Explicitly show prior over (e.g., Dirichlet)

wN

p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …

“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess”

Posterior distributionp( | , w)

is also a Dirichletjust like the prior p( | ).

prior = Dirichlet() posterior = Dirichlet(+counts(w))Mean of posterior is like the max-likelihood estimate of ,but smooth the corpus counts by adding “pseudocounts” .

(But better to use whole posterior, not just the mean.)

32

Training and Test Documents

wN1

“Learn from document 1,use it to predict document 2”

wN2

What do goodconfigurations looklike if N1 is large?

What if N1 is small?

test

train

33

Many Documents

w1N1

“Each document has itsown unigram model”

w2N2

w3N3 Now does observing

docs 1 and 3 help stillpredict doc 2?

Only if learns thatall the ’s are similar

(low variance).

And in that case,why even haveseparate ’s?

34

Many Documents

wND

D

“Each document has itsown unigram model”

givend Dirichlet()wdi d

or tuned to maximizetraining or dev set likelihood

35

Bayesian Text Categorization

wND

DK

“Each document choosesone of only K topics(unigram models)”

givenk Dirichlet()wdi k but which k?

36

Bayesian Text Categorization

w

z

ND

DK

“Each document choosesone of only K topics(unigram models)”

Allows documents to differconsiderably while somestill share parameters.

And, we can infer theprobability that twodocuments have the

same topic z.

Might observe some topics.

givenk Dirichlet()wdi zd

a topicin 1…K

given Dirichlet()zd

a distributionover topics 1…K

37

Latent Dirichlet Allocation(Blei, Ng & Jordan 2003)

w

z

ND

DK

“Each documentchooses a mixtureof all K topics;

each word gets itsown topic”

38

(Part of) one assignment to LDA’s variables

slide thanks to Dave Blei

39

(Part of) one assignment to LDA’s variables

slide thanks to Dave Blei

40

Latent Dirichlet Allocation: Inference?

w

z1

DK

w1 w2

z2

w3

z3

K

41

Finite-State Dirichlet Allocation(Cui & Eisner 2006)

w1

z1

w2

z2

w3

z3…

K

D

“A different HMM foreach document”

42

Variants of Latent Dirichlet Allocation

Syntactic topic model: A word or its topic isinfluenced by its syntactic position.

Correlated topic model, hierarchical topic model, …:Some topics resemble other topics.

Polylingual topic model: All versions of the samedocument use the same topic mixture, even ifthey’re in different languages. (Why useful?)

Relational topic model: Documents on the sametopic are generated separately but tend to link toone another. (Why useful?)

Dynamic topic model: We also observe a year foreach document. The k topics used in 2011 haveevolved slightly from their counterparts in 2010.

43

Dynamic Topic Model

slide thanks to Dave Blei

44

Dynamic Topic Model

slide thanks to Dave Blei

45

Dynamic Topic Model

slide thanks to Dave Blei

46

Dynamic Topic Model

slide thanks to Dave Blei

47

Remember: Finite-State Dirichlet Allocation(Cui & Eisner 2006)

w1

z1

w2

z2

w3

z3…

K

D

“A different HMM foreach document”

48

Bayesian HMM

w1

z1

w2

z2

w3

z3…

K

D

“Shared HMM forall documents”

(or just have 1 document)

We have to estimatetransition parameters and emission parameters .