Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...

Latent variable models for discrete data

Jianfei Chen

Department of Computer Science and TechnologyTsinghua University, Beijing 100084

[email protected]

Janurary 13, 2014

Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012.Chapter 27.

Introduction

We want to model three types of discrete data

Sequence of tokens: p(yi,1:Li)

Bag of words: p(ni)

Discrete features: p(yi,1:R)

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 2 / 21

Outline

Mixture Models

LSA / PLSI / LDA / GaP / NMF

LDA

EvaluationInferenceVariants: CTM, DTM, LDA-HMM, SLDA, MedLDA, etc.

RBM


Mixture models

Theorem

If ∀i,Xi ∼ Poi(λi), let n =∑

iXi

p(X1, · · · , Xk|n) = Mu(X|n, π)

where πi =λi∑k λk

.


Exponential Family PCA

latent semantic analysis (LSA) / latent semantic indexing (LSI)

Sequence of tokens: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))

Discrete features: p(yi,1:R|zi) =∏Rr=1Cat(yir|S(Wrzi))

Bag of words (known Li): p(ni|Li, zi) = Mu(ni|Li, S(Wzi))

Bag of words (unknown Li): p(ni|zi) =∏Vv=1 Poi(niv|exp(wv,:zi))

where S(·) is the softmax transformation, zi ∈ RK , W,Wr ∈ RV×K .Inference

coordinate ascent / degenerated EM (problem: overfitting?)

variational EM / MCMC


Gamma-Poisson Model

LDA

models p(ni|Li, πi) = Mu(ni|Li,Bπi)Prior πi ∼ Dir(α)

Constraint 0 ≤ πik,∑

j πik = 1, 0 ≤ Bvk,∑

v Bvk = 1

GaP

models p(ni|z+i ) =∏Vv=1 Poi(niv|b>v,:z

+i )

Prior p(z+i ) =∏kGa(z+ik|αk, βk)

Constraint 0 ≤ zik, 0 ≤ BvkCan use sparse-inducing prior (27.17)GaP only have non-negative constraints


Non-negative matrix factorization

Given non-negative matrix V , find non-negative matrix factors W,H suchthat

V ≈WH

Vi ≈∑k

WikHk

Can be view as GaP when prior αk = βk = 0.


Seung, D., and L. Lee. ”Algorithms for non-negative matrix factorization.” Advances inneural information processing systems.

Latent Dirichlet Allocation (LDA)

Notation

πz|α ∼ Dir(α) (1)

qil|πi ∼ Cat(πi) (2)

bk|γ ∼ Dir(γ) (3)

yil|qil = k,B ∼ Cat(bk) (4)

Geometric interpretation

Simplex: handle ambiguity (?)

Unidentifiable: Labeled LDA


D. Blei et al. ”Latent dirichlet allocation.” JMLRG. Heinrich. ”Parameter estimation for text analysis.”D. Ramage, et al. ”Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora.” EMNLP

http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

Evaluation: Perplexity

Perplexity of language model q given language p is defined as (both p, qare stocastic process)

perplexity(p, q) = 2H(p,q)

where H(p, q) is cross-entrypy

H(p, q) = limN→∞

− 1

N

∑y1:N

p(y1:N ) log q(y1:N )

Approximations

N is finite

p(y1:N ) = δy∗1:N

(y1:N )


Evaluation: Perplexity

H(p, q) = − 1

Nlog q(y∗1:N )

Intuition: weighted average branching factorFor unigram model

H = − 1

N

N∑i=1

1

Li

Li∑l=1

log q(y∗il)

For LDA

H = − 1

N

∑i=1N

p(y∗i,1:Li)

Use variational evidence lower bound (ELBO)

Use annealed importance sampling

Use validation set and plug in approximation


H. Wallach, et al. ”Evaluation methods for topic models.” ICML 2009

Evaluation: Coherence

TODO


D. Newman et al. ”Automatic evaluation of topic coherence.” NAACL HLT 2010.

Inference

Exponential number of inference algorithms

Variational inference vs sampling vs both

Collapsed vs non-collpased

Online vs stocastic vs offline

Empirical Bayes vs fully Bayes

Other algorithms: expectation propagation, etc.


Inference: towards large scale

algorithms

Online / stocasticSparsitySpectral methods

system

Distributed: Yahoo-LDA, Petuum, Parameter-Server, etc.GPU: BIDMach, etc.


Model Selection

Compute evidence with AIS / ELBO

Cross validation

Bayesian non-parametrics


Teh et al. ”Hierarchical dirichlet processes.” Journal of the american statistical association (2006).

Extensions of LDA

Correlation: Correlated topic model

Time series: Dynamic topic model

Syntax: LDA-HMM

Supervision: many

1D categorial label: SLDA (generative), DLDA (discrimitive), MedLDA(regularized)nD label: MR-LDA, random effects mixture of experts, conditionaltopic random field, Dirchlet multinomial regression LDAK labels per document: labeled LDAlabels per word: TagLDA

Structural: RTM


Restricted Boltzmann machines


p(h,v|θ) = 1

Z(θ)

R∏r=1

K∏k=1

ψrk(vr, hk)

where h,v are binary vectors.factorized posterior

p(h|v, θ) =∏k

p(hk|v, θ)

advantage: symmetric, both posterior inference (backward) and generating(forward) are easy.

Exponential family harmonium (harmonium is 2-layer UGM)



Goal: maximize p(v|θ)

∇wl = Epemp(·|θ)[vh>]− Ep(·|θ)[vh>]


Conclusions

Why there are many things to do

Exponential number of inference algorithms

Exponential number of models

Exponential × exponential number of solutions

Application, evaluation, theory (e.g. spectral), etc.

Need a way for information retriver, data miners find correct & fastsolutions for them...


Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...

Documents

Transcript of Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...