Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...
Transcript of Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...
Latent variable models for discrete data
Jianfei Chen
Department of Computer Science and TechnologyTsinghua University, Beijing 100084
Janurary 13, 2014
Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012.Chapter 27.
Introduction
We want to model three types of discrete data
Sequence of tokens: p(yi,1:Li)
Bag of words: p(ni)
Discrete features: p(yi,1:R)
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 2 / 21
Outline
Mixture Models
LSA / PLSI / LDA / GaP / NMF
LDA
EvaluationInferenceVariants: CTM, DTM, LDA-HMM, SLDA, MedLDA, etc.
RBM
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 3 / 21
Mixture models
p(y) =∑k
p(y|qi = k)p(qi = k)
Sequence of tokens: p(yi,1:Li |qi = k) =∏Lil=1Cat(yil|bk)
Discrete features: p(yi,1:R|qi = k) =∏Rr=1Cat(yir|b
(r)k )
Bag of words (known Li): p(ni|Li, qi = k) = Mu(ni|Li,bk)Bag of words (unknown Li): p(ni|qi = k) =
∏Vv=1 Poi(niv|λvk)
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 4 / 21
Mixture models
Theorem
If ∀i,Xi ∼ Poi(λi), let n =∑
iXi
p(X1, · · · , Xk|n) = Mu(X|n, π)
where πi =λi∑k λk
.
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 5 / 21
Exponential Family PCA
latent semantic analysis (LSA) / latent semantic indexing (LSI)
Sequence of tokens: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))
Discrete features: p(yi,1:R|zi) =∏Rr=1Cat(yir|S(Wrzi))
Bag of words (known Li): p(ni|Li, zi) = Mu(ni|Li, S(Wzi))
Bag of words (unknown Li): p(ni|zi) =∏Vv=1 Poi(niv|exp(wv,:zi))
where S(·) is the softmax transformation, zi ∈ RK , W,Wr ∈ RV×K .Inference
coordinate ascent / degenerated EM (problem: overfitting?)
variational EM / MCMC
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 6 / 21
LSA / PLSI / LDA
Unigram: p(yi,1:Li |qi = k) =∏Lil=1Cat(yil|bk)
LSI: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))
PLSI: p(yi,1:Li |πi) =∏Lil=1Cat(yil|Bπi)
LDA: p(yi,1:Li |πi) =∏Lil=1Cat(yil|Bπi), πi ∼ Dir(πi|α)
LDA for other data types
Bag of words:p(ni|Li, πi) = Mu(ni|Li,Bπi)Discrete features:p(yi,1:R|πi) =
∏Rr=1Cat(yir|B(r)πi)
Question: What is dual parameter? Why is it convenient?
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 7 / 21
Marlin, Benjamin M. ”Modeling user rating profiles for collaborative filtering.” Advancesin neural information processing systems. 2003.
Gamma-Poisson Model
LDA
models p(ni|Li, πi) = Mu(ni|Li,Bπi)Prior πi ∼ Dir(α)
Constraint 0 ≤ πik,∑
j πik = 1, 0 ≤ Bvk,∑
v Bvk = 1
GaP
models p(ni|z+i ) =∏Vv=1 Poi(niv|b>v,:z
+i )
Prior p(z+i ) =∏kGa(z+ik|αk, βk)
Constraint 0 ≤ zik, 0 ≤ BvkCan use sparse-inducing prior (27.17)GaP only have non-negative constraints
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 8 / 21
Non-negative matrix factorization
Given non-negative matrix V , find non-negative matrix factors W,H suchthat
V ≈WH
Vi ≈∑k
WikHk
Can be view as GaP when prior αk = βk = 0.
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 9 / 21
Seung, D., and L. Lee. ”Algorithms for non-negative matrix factorization.” Advances inneural information processing systems.
Latent Dirichlet Allocation (LDA)
Notation
πz|α ∼ Dir(α) (1)
qil|πi ∼ Cat(πi) (2)
bk|γ ∼ Dir(γ) (3)
yil|qil = k,B ∼ Cat(bk) (4)
Geometric interpretation
Simplex: handle ambiguity (?)
Unidentifiable: Labeled LDA
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 10 / 21
D. Blei et al. ”Latent dirichlet allocation.” JMLRG. Heinrich. ”Parameter estimation for text analysis.”D. Ramage, et al. ”Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora.” EMNLP
http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
Evaluation: Perplexity
Perplexity of language model q given language p is defined as (both p, qare stocastic process)
perplexity(p, q) = 2H(p,q)
where H(p, q) is cross-entrypy
H(p, q) = limN→∞
− 1
N
∑y1:N
p(y1:N ) log q(y1:N )
Approximations
N is finite
p(y1:N ) = δy∗1:N
(y1:N )
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 11 / 21
Evaluation: Perplexity
H(p, q) = − 1
Nlog q(y∗1:N )
Intuition: weighted average branching factorFor unigram model
H = − 1
N
N∑i=1
1
Li
Li∑l=1
log q(y∗il)
For LDA
H = − 1
N
∑i=1N
p(y∗i,1:Li)
Use variational evidence lower bound (ELBO)
Use annealed importance sampling
Use validation set and plug in approximation
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 12 / 21
H. Wallach, et al. ”Evaluation methods for topic models.” ICML 2009
Evaluation: Coherence
TODO
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 13 / 21
D. Newman et al. ”Automatic evaluation of topic coherence.” NAACL HLT 2010.
Inference
Exponential number of inference algorithms
Variational inference vs sampling vs both
Collapsed vs non-collpased
Online vs stocastic vs offline
Empirical Bayes vs fully Bayes
Other algorithms: expectation propagation, etc.
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 14 / 21
Inference: towards large scale
algorithms
Online / stocasticSparsitySpectral methods
system
Distributed: Yahoo-LDA, Petuum, Parameter-Server, etc.GPU: BIDMach, etc.
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 15 / 21
Model Selection
Compute evidence with AIS / ELBO
Cross validation
Bayesian non-parametrics
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 16 / 21
Teh et al. ”Hierarchical dirichlet processes.” Journal of the american statistical association (2006).
Extensions of LDA
Correlation: Correlated topic model
Time series: Dynamic topic model
Syntax: LDA-HMM
Supervision: many
1D categorial label: SLDA (generative), DLDA (discrimitive), MedLDA(regularized)nD label: MR-LDA, random effects mixture of experts, conditionaltopic random field, Dirchlet multinomial regression LDAK labels per document: labeled LDAlabels per word: TagLDA
Structural: RTM
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 17 / 21
Restricted Boltzmann machines
Restricted Boltzmann machines
p(h,v|θ) = 1
Z(θ)
R∏r=1
K∏k=1
ψrk(vr, hk)
where h,v are binary vectors.factorized posterior
p(h|v, θ) =∏k
p(hk|v, θ)
advantage: symmetric, both posterior inference (backward) and generating(forward) are easy.
Exponential family harmonium (harmonium is 2-layer UGM)
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 18 / 21
Restricted Boltzmann machines
Binary latent and binary visiable (other models exist, see Table 27.2)
p(v,h|θ) = 1
Z(θ)exp(−E(v,h; θ)) (5)
E(v,h; θ) = v>Wh (6)
p(h|v, θ) =∏k
Ber(hk|sigm(w>:,k,v)) (7)
p(v|h, θ) =∏r
Ber(vr|sigm(w>r,:,h)) (8)
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 19 / 21
Restricted Boltzmann machines
Goal: maximize p(v|θ)
∇wl = Epemp(·|θ)[vh>]− Ep(·|θ)[vh>]
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 20 / 21
Conclusions
Why there are many things to do
Exponential number of inference algorithms
Exponential number of models
Exponential × exponential number of solutions
Application, evaluation, theory (e.g. spectral), etc.
Need a way for information retriver, data miners find correct & fastsolutions for them...
Jianfei Chen (THU) Latent variable models Janurary 13, 2014 21 / 21