Nicolas loeff lda

39
Latent Dirichlet Allocation D. Blei, A. Ng, M. Jordan Includes some slides adapted from J.Ramos at Rutgers, M. Steyvers and M. Rozen-Zvi at UCI, L. Fei Fei at UIUC.

Transcript of Nicolas loeff lda

Page 1: Nicolas loeff lda

Latent Dirichlet Allocation

D. Blei, A. Ng, M. Jordan

Includes some slides adapted from J.Ramos at Rutgers, M. Steyvers and M. Rozen-Zvi at UCI, L. Fei Fei at UIUC.

Page 2: Nicolas loeff lda

Overview

What is so special about text? Classification methods LSI Unigram / Mixture of Unigram Probabilistic LSI (Aspect Model) LDA model Geometric interpretation

Page 3: Nicolas loeff lda

What is so special about text?

No obvious relation between features High dimensionality, (often larger

vocabulary, V, than the number of features!)

Importance of speed

Page 4: Nicolas loeff lda

The need for dimensionality reduction Representation:

Presenting documents as vectors in the words space - ‘bag of words’ representation

It is a sparse representation, V>>|D| A need to define conceptual closeness

Page 5: Nicolas loeff lda

Bag of wordsOf all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception,

retinal, cerebral cortex,eye, cell, optical

nerve, imageHubel, Wiesel

China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures are likely to further annoy the US, which has long argued that China's exports are unfairly helped by a deliberately undervalued yuan. Beijing agrees the surplus is too high, but says the yuan is only one factor. Bank of China governor Zhou Xiaochuan said the country also needed to do more to boost domestic demand so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value.

China, trade, surplus, commerce,

exports, imports, US, yuan, bank, domestic,

foreign, increase, trade, value

Page 6: Nicolas loeff lda

Bag of words

Order of words in document can be ignored, only count matters.

Probability theory: Exchangability (includes IID) (Aldous, 1985).

Exchangable RVs have a representation as mixture distribution (de Finetti, 1990).

Page 7: Nicolas loeff lda

What does this have to do with Vision?

ObjectObject Bag of Bag of ‘words’‘words’

Page 8: Nicolas loeff lda

TF-IDF Weighing Scheme (Salton and McGill, 1983) Given corpus D, word w, document d,

calculate wd = fw, d · log (|D|/fw, D) Many varieties of basic scheme.

Search procedure:Scan each d, compute each wi, d, return set D’ that maximizes Σi wi, d

Page 9: Nicolas loeff lda

A Spatial Representation: Latent Semantic Analysis (Deerwester, 1990)

Document/Term count matrix

1…

16…

0…

SCIENCE…

6190RESEARCH

2012SOUL

3034LOVE

Doc3 … Doc2Doc1SVD

High dimensional space, not as high as |V|

SOUL

RESEARCH

LOVE

SCIENCE

• Each word is a single point in semantic space (dimensionality reduction)• Similarity measured by cosine of angle between word vectors

Page 10: Nicolas loeff lda

Feature Vector representation

From: Modeling the Internet and the Web: Probabilistic methods and Algorithms, Pierre Baldi, Paolo Frasconi, Padhraic Smyth

Page 11: Nicolas loeff lda

Classification: assigning words to topics

Different models for data:

Discrete Classifier, modeling the boundaries between different classes of the data

Prediction of Categorical output e.g., SVM

Density Estimator: modeling the distribution of the data points themselves

Generative Models e.g. NB

Page 12: Nicolas loeff lda

Generative Models – Latent semantic structure

Latent Structure

Words

∑=

),()( ww PPDistribution over words

)()()|()|(

wwwP

PPP =

Inferring latent structure

Page 13: Nicolas loeff lda

Topic Models Unsupervised learning of topics (“gist”) of

documents: articles/chapters conversations emails .... any verbal context

Topics are useful latent structures to explain semantic association

Page 14: Nicolas loeff lda

Probabilistic Generative Model

Each document is a probability distribution over topics

Each topic is a probability distribution over words

Page 15: Nicolas loeff lda

Generative Process

loan

TOPIC 1

money

loan

bank

money

bank

river

TOPIC 2

river

riverstream

bank

bank

stream

bank

loanDOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2

river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2

bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2

bank2 money1 bank1 stream2 river2 bank2 stream2 bank2

money1

DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2

bank1 money1 river2 bank1 money1 bank1 loan1 money1

stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1

money1 river2 bank1 money1 bank1 loan1 bank1 money1

stream2

.3

.8

.2

Mixture components

Mixture weights

Bayesian approach: use priors Mixture weights ~ Dirichlet( α ) Mixture components ~ Dirichlet( β )

.7

Page 16: Nicolas loeff lda

Vision: Topic = Object categories

Page 17: Nicolas loeff lda

Simple Model: Unigram

Words of document are drawn IID from a single multinomial distribution:

Page 18: Nicolas loeff lda

Unigram Mixture Model

First choose topic z, then generate words conditionally independent given topic.

Page 19: Nicolas loeff lda

Unigram Mixture Model

First choose topic z, then generate words conditionally independent given topic.

Page 20: Nicolas loeff lda

Unigram Mixture Model

First choose topic z, then generate words conditionally independent given topic.

Page 21: Nicolas loeff lda

Probabilistic Latent Semantic Indexing (Hoffman, 1999) Document d in training set, and word wn are

conditionally independent given topic.

Not truly generative (dummy r.v. d). Number of parameters grows with size of corpus (overfitting).

Document may contain several topics.

Page 22: Nicolas loeff lda

Vision app.: Sivic et al., 2005

wN

d z

D

“face”

Page 23: Nicolas loeff lda

LDA

Page 24: Nicolas loeff lda

LDA

Page 25: Nicolas loeff lda

LDA

Page 26: Nicolas loeff lda

LDA

Page 27: Nicolas loeff lda

LDA

Page 28: Nicolas loeff lda

Vision app.: Fei Fei Li, 2005

wN

c z

D

π

“beach”

Page 29: Nicolas loeff lda

Example: Word density distribution

Page 30: Nicolas loeff lda

A geometric interpretation

Page 31: Nicolas loeff lda

LDA

Topics sampled repeatedly in each Document (like pLSI).

But, number of parameters does not grow with size of corpus.

Problem: Inference.

Page 32: Nicolas loeff lda

LDA - Inference

Coupling between Dirchlet distribuions makes inference intractable.

Blei, 2001: Variational Approximation

Page 33: Nicolas loeff lda

LDA - Inference

Other procedures: Monte Carlo Markov Chin (Griffith et al.,

2002)Expectation Propagation (Minka et al., 2002)

Page 34: Nicolas loeff lda

Experiments

Perplexity: Inverse of geometric mean per-word likelihood (monotonically decreasing function of likelihood):

Idea: Lower Perplexity implies better generalization.

Page 35: Nicolas loeff lda

Experiments – Nematode corpus

Page 36: Nicolas loeff lda

Experiments – AP corpus

Page 37: Nicolas loeff lda
Page 38: Nicolas loeff lda

Polysemy

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMESCOMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYERPLAY

PLAYINGSOCCERPLAYED

BALLTEAMSBASKET

FOOTBALLSCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIALCOURT

CASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICEEVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

Page 39: Nicolas loeff lda

Choosing number of topics Subjective interpretability

Bayesian model selection Griffiths & Steyvers (2004)

Generalization test

Non-parametric Bayesian statistics Infinite models; models that grow with size of data

Teh, Jordan, Teal, & Blei (2004) Blei, Griffiths, Jordan, Tenenbaum (2004)