Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

600.465 - Intro to NLP - J. Eisner 1

Topic Modeling

Embedding Documents A trick from Information Retrieval Each document in corpus is a length-k vector Or each paragraph, or whatever

(0, 3, 3, 1, 0, 7, . . . 1, 0)

a single documentPretty sparse, so pretty noisy! Hard to tell which docs are similar.

Wish we could smooth the vector:what would the document look like if it rambled on forever?

Latent Semantic Analysis A trick from Information Retrieval Each document in corpus is a length-k vector Plot all documents in corpus

True plot in k dimensionsReduced-dimensionality plot

a clustering algorithmmight discover thiscluster of “similar”documents – similarthanks to smoothing!

Latent Semantic Analysis Reduced plot is a perspective drawing of true plot It projects true plot onto a few axes a best choice of axes – shows most variation in the data. Found by linear algebra: “Principal Components Analysis” (PCA)

Latent Semantic Analysis SVD plot allows best possible reconstruction of true plot

(i.e., can recover 3-D coordinates with minimal distortion) Ignores variation in the axes that it didn’t pick Hope that variation’s just noise and we want to ignore it

topic A

Latent Semantic Analysis SVD finds a small number of topic vectors Approximates each doc as linear combination of topics Coordinates in reduced plot = linear coefficients How much of topic A in this document? How much of topic B? Each topic is a collection of words that tend to appear together

topic A

Latent Semantic Analysis New coordinates might actually be useful for Info Retrieval To compare 2 documents, or a query and a document: Project both into reduced space: do they have topics in common? Even if they have no words in common!

topic A

Latent Semantic Analysis topics extracted for IR might help sense disambiguation

Each word is like a tiny document: (0,0,0,1,0,0,…) Express word as a linear combination of topics Each topic corresponds to a sense? E.g., “Jordan” has Mideast and Sports topics

(plus Advertising topic, alas, which is same sense as Sports) Word’s sense in a document: which of its topics are strongest in the

document? Groups senses as well as splitting them One word has several topics and many words have same topic

Latent Semantic Analysis A perspective on Principal Components Analysis (PCA) Imagine an electrical circuit that connects terms to docs …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrix of strengths(how strong is each

term in each document?)

Each connection has aweight given by the matrix.

Latent Semantic Analysis Which documents is term 5 strong in?

docs 2, 5, 6light up strongest.

1 2 3 4 5 6 7 8 9terms

This answers a queryconsisting of terms 5 and 8!

really just matrix multiplication:term vector (query) x strength matrix = doc vector .

Latent Semantic Analysis Which documents are terms 5 and 8 strong in?

1 2 3 4 5 6 7 8 9terms

Latent Semantic Analysis Conversely, what terms are strong in document 5?

gives doc 5’s coordinates!

1 2 3 4 5 6 7 8 9terms

Latent Semantic Analysis SVD approximates by smaller 3-layer network Forces sparse data through a bottleneck, smoothing it

1 2 3 4 5 6 7 8 9terms

topics

Latent Semantic Analysis I.e., smooth sparse data by matrix approx: M A B A encodes camera angle, B gives each doc’s new coords

1 2 3 4 5 6 7 8 9terms

matrixM

1 2 3 4 5 6 7 8 9terms

topics

Latent Semantic AnalysisCompletely symmetric! Regard A, B as projecting terms and docsinto a low-dimensional “topic space” where their similarity can bejudged.

1 2 3 4 5 6 7 8 9terms

matrixM

1 2 3 4 5 6 7 8 9terms

topics

Latent Semantic Analysis Completely symmetric. Regard A, B as projecting terms and docs

into a low-dimensional “topic space” where their similarity can bejudged.

Cluster documents (helps sparsity problem!) Cluster words Compare a word with a doc Identify a word’s topics with its senses sense disambiguation by looking at document’s senses

Identify a document’s topics with its topics topic categorization

If you’ve seen SVD before …

1 2 3 4 5 6 7 8 9terms

SVD actually decomposes M = A D B’ exactly A = camera angle (orthonormal); D diagonal; B’ orthonormal

matrixM

1 2 3 4 5 6 7 8 9terms

Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units

matrixM

1 2 3 4 5 6 7 8 9terms

Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units

matrixM

1 2 3 4 5 6 7 8 9terms

To simplify picture, can write M A (DB’) = AB

matrixM

B =DB’

How should you pick j (number of blue units)? Just like picking number of clusters: How well does system work with each j (on held-out data)?

Directed graphical models (Bayes nets)

slide thanks to Zoubin Ghahramani (modified)

Under any model: p(A, B, C, D, E) = p(A)p(B|A)p(C|A,B)p(D|A,B,C)p(E|A,B,C,D)Model above says:

Unigram model for generating text

w1 w2 w3…

p(w1) p(w2) p(w3) …

Explicitly show model’s parameters

w1 w2 w3…

p() p(w1 | ) p(w2 | ) p(w3 | ) …

“ is a vector that sayswhich unigrams are likely”

“Plate notation” simplifies diagram

p() p(w1 | ) p(w2 | ) p(w3 | ) …

“ is a vector that sayswhich unigrams are likely”

Learn from observed words(rather than vice-versa)

p() p(w1 | ) p(w2 | ) p(w3 | ) …

Explicitly show prior over (e.g., Dirichlet)

p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …

“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess” given

Dirichlet()wi

dog the cat

Dirichlet DistributionEach point on a k dimensional simplex is a multinomial probabilitydistribution:

dog the cat

slide thanks to Nigel Crook

Dirichlet Distribution

A Dirichlet Distribution is a distribution over multinomialdistributions in the simplex.

slide thanks to Percy Liang and Dan Klein

Dirichlet DistributionExample draws from a Dirichlet Distribution over the 3-simplex:

Dirichlet(5,5,5)

Dirichlet(0.2, 5, 0.2)

Dirichlet(0.5,0.5,0.5)

Explicitly show prior over (e.g., Dirichlet)

p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …

“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess”

Posterior distributionp( | , w)

is also a Dirichletjust like the prior p( | ).

prior = Dirichlet() posterior = Dirichlet(+counts(w))Mean of posterior is like the max-likelihood estimate of ,but smooth the corpus counts by adding “pseudocounts” .

(But better to use whole posterior, not just the mean.)

Training and Test Documents

“Learn from document 1,use it to predict document 2”

What do goodconfigurations looklike if N1 is large?

What if N1 is small?

Many Documents

“Each document has itsown unigram model”

w3N3 Now does observing

docs 1 and 3 help stillpredict doc 2?

Only if learns thatall the ’s are similar

(low variance).

And in that case,why even haveseparate ’s?

Many Documents

“Each document has itsown unigram model”

givend Dirichlet()wdi d

or tuned to maximizetraining or dev set likelihood

Bayesian Text Categorization

“Each document choosesone of only K topics(unigram models)”

givenk Dirichlet()wdi k but which k?

Bayesian Text Categorization

“Each document choosesone of only K topics(unigram models)”

Allows documents to differconsiderably while somestill share parameters.

And, we can infer theprobability that twodocuments have the

same topic z.

Might observe some topics.

givenk Dirichlet()wdi zd

a topicin 1…K

given Dirichlet()zd

a distributionover topics 1…K

Latent Dirichlet Allocation(Blei, Ng & Jordan 2003)

“Each documentchooses a mixtureof all K topics;

each word gets itsown topic”

(Part of) one assignment to LDA’s variables

slide thanks to Dave Blei

(Part of) one assignment to LDA’s variables

Latent Dirichlet Allocation: Inference?

Finite-State Dirichlet Allocation(Cui & Eisner 2006)

“A different HMM foreach document”

Variants of Latent Dirichlet Allocation

Syntactic topic model: A word or its topic isinfluenced by its syntactic position.

Correlated topic model, hierarchical topic model, …:Some topics resemble other topics.

Polylingual topic model: All versions of the samedocument use the same topic mixture, even ifthey’re in different languages. (Why useful?)

Relational topic model: Documents on the sametopic are generated separately but tend to link toone another. (Why useful?)

Dynamic topic model: We also observe a year foreach document. The k topics used in 2011 haveevolved slightly from their counterparts in 2010.

Dynamic Topic Model

Remember: Finite-State Dirichlet Allocation(Cui & Eisner 2006)

“A different HMM foreach document”

Bayesian HMM

“Shared HMM forall documents”

(or just have 1 document)

We have to estimatetransition parameters and emission parameters .

Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

Documents

Transcript of Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

*Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)

Words vs. Terms 600.465 - Intro to NLP - J. Eisner.

600.465 - Intro to NLP - J. Eisner1 Grouping Words.

600.465 - Intro to NLP - J. Eisner1 Semantics From Syntax to Meaning!

600.465 - Intro to NLP - J. Eisner1 Syntactic Features Morphology, heads, gaps, etc.

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation

Introduction to Natural Language Processing (600.465) Markov Models

Lecture 35: Current & Future NLP Research - cs.jhu.edujason/465/PDFSlides/lect35-future.pdf · 600.465 - Intro to NLP - J. Eisner 1 Current & Future NLP Research A Few Random Remarks

600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples.

1 600.465 Natural Language Processing Prof: Jason Eisner Webpage: jason/465 syllabus, announcements, slides, homeworks.

600.465 - Intro to NLP - J. Eisner1 Words vs. Terms.

600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.

600.465 - Intro to NLP - J. Eisner1 Finite-State Methods.

600.465 Connecting the dots - II (NLP in Practice) Delip Rao delip@jhu.edu.

9/14/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing (600.465) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu.

600.465 - Intro to NLP - J. Eisner1 NLP Tasks and Applications.

Introduction to Natural Language Processing (600.465) Parsing: Introduction

600.465 - Intro to NLP - J. Eisner1 Bayes’ Theorem.

Lecture 31 Equivalence Theorem and Huygens’ Principleengineering.purdue.edu/wcchew/ece604f19/Lecture Notes/Lect31.pdfHuygens’ Principle 31.1 Equivalence Theorem or Equivalence