A Unifying Review of Gaussian Linear Models (Roweis 1999)

18
A Unifying Review of Linear Gaussian Models 1 Sam Roweis, Zoubin Ghahramani Feynman Liang Application #: 10342444 November 11, 2014 1 Roweis, Sam, and Zoubin Ghahramani. “A Unifying Review of Linear Gaussian Models.” Neural Computation 11.2 (1999): 305–45. Print. F. Liang Linear Gaussian Models Nov 2014 1 / 18

Transcript of A Unifying Review of Gaussian Linear Models (Roweis 1999)

Page 1: A Unifying Review of Gaussian Linear Models (Roweis 1999)

A Unifying Review of Linear Gaussian Models1

Sam Roweis, Zoubin Ghahramani

Feynman LiangApplication #: 10342444

November 11, 2014

1Roweis, Sam, and Zoubin Ghahramani. “A Unifying Review of Linear GaussianModels.” Neural Computation 11.2 (1999): 305–45. Print.

F. Liang Linear Gaussian Models Nov 2014 1 / 18

Page 2: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Motivation

Many superficially disparate models. . .

(a) Factor Analysis (b) PCA

(c) Mixture of Gaussians (d) Hidden Markov Models

F. Liang Linear Gaussian Models Nov 2014 2 / 18

Page 3: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Outline

Basic model

Inference and learningproblems

EM algorithm

Various specializations ofthe basic model

Factor Analysis

SPCA

PCA

Kalman Filter

Gaussian Mixture Model

1-NN

HMM

cts

stat

e

A=

0

Rdia

g

R=εI

R=

limε→

0 εI

A 6=0

discretestate

A=

0R

=lim

ε→0εR

0

A 6=0

F. Liang Linear Gaussian Models Nov 2014 3 / 18

Page 4: A Unifying Review of Gaussian Linear Models (Roweis 1999)

The Basic (Generative) Model

Goal: Model P({xt}τt=1, {yt}τt=1)Assumptions:

Linear dynamics, additive Gaussiannoise

xt+1 = Axt + w•, w• ∼ N (0,Q)

yt = Cxt + v•, v• ∼ N (0,R)

wlog E[w•] = E[v•] = 0

Markov property

Time homogeneity

xt xt+1

w•

yt

v•

A

C

+

+

t

Figure: The Basic Model as a DBN

P({xt}τt=1, {yt}τt=1) = P(x1)τ−1∏t=1

P(xt+1|xt)τ∏

t=1

P(yt |xt)

F. Liang Linear Gaussian Models Nov 2014 4 / 18

Page 5: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Why Gaussians?

Gaussian family closed under affine transforms

x ∼ N (µx ,Σx), y ∼ N (µy ,Σy ), a, b, c ∈ R=⇒ ax + by + c ∼ N (aµx + bµy + c , a2Σx + b2Σy )

Gaussian is conjugate prior for Gaussian likelihood

P(x) Normal,P(y |x) Normal =⇒ P(x |y) Normal

F. Liang Linear Gaussian Models Nov 2014 5 / 18

Page 6: A Unifying Review of Gaussian Linear Models (Roweis 1999)

The Inference Problem

Given the system model and initial distribution ({A,C ,Q,R, µ1,Q1}):

Filtering: P(xt |{yi}ti=1)

Smoothing: P(xt |{yi}τi=1) where τ ≥ t

If we had the partition function:

P({yi}τi=1) =

∫∀{xi}τi=1

P({xi}, {yi})d{xi}

Then

P(xt |{yi}τi=1) =P({xi}, {yi})

P({yi})

F. Liang Linear Gaussian Models Nov 2014 6 / 18

Page 7: A Unifying Review of Gaussian Linear Models (Roweis 1999)

The Learning Problem

Let θ = {A,C ,Q,R, µ1,Q1}, X = {xi}τi=1, Y = {yi}τi=1.Given (several) observable sequences Y :

arg maxθ L(θ) = arg max logP(Y |θ)

Solved by expectation maximization.

F. Liang Linear Gaussian Models Nov 2014 7 / 18

Page 8: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Expectation Maximixation

For any distribution Q on Sx :

L(θ) ≥ F(Q, θ) =

∫XQ(X ) logP(X ,Y |θ)−

∫XQ(X ) logQ(X )dX

= L(θ) + H(Q,P(·|Y , θ))− H(Q)

= L(θ)− DKL(Q||P(·|Y , θ))

Monotonically increasing coordinate ascent on F(Q, θ):

E step: Qk+1 ← arg maxQ F(Q, θk) = P(X |Y , θk)

M step: θk+1 ← arg maxθ F(Qk+1, θ)

F. Liang Linear Gaussian Models Nov 2014 8 / 18

Page 9: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Continuous-State Static Modeling

Assumptions:

x is continuously supported

A = 0

x• = w• ∼ N (0,Q) =⇒ y• = Cx• + v• ∼ N (0,CQCT + R)

wlog Q = I

Efficient Inference Using Sufficient Statistics: Gaussian is conjugateprior for Gaussian likelihood, so

P(x•|y•) = N (βy•, I − βC ), β = CT (CCT + R)−1

Learning: R must be constrained to avoid degenerate solution. . .

F. Liang Linear Gaussian Models Nov 2014 9 / 18

Page 10: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Continuous-State Static Modeling: Factor Analysis

y• = Cx• + v• ∼ N (0,CCT + R)

Additional Assumption:

R diagonal =⇒ observation noise v• independent along basis for y

Interpretation:

R : variance along basis

C : correlation structure of latent factors

Properties:

Scale invariant

Not rotation invariant

F. Liang Linear Gaussian Models Nov 2014 10 / 18

Page 11: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Continuous-State Static Modeling: SPCA and PCA

y• = Cx• + v• ∼ N (0,CCT + R)

Additional Assumptions:

R = εI , ε ∈ RFor PCA: R = limε→0 εI

Interpretation:

ε : global noise level

Columns of C : principal components(optimizes three equivalent objectives)

Properties

Rotation invariant

Not scale invariant

F. Liang Linear Gaussian Models Nov 2014 11 / 18

Page 12: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Continuous-State Dynamic Modeling: Kalman Filters

Relax A = 0 assumptio.

Optimal Bayes filter assuming linearity and normality (conjugate prior)

F. Liang Linear Gaussian Models Nov 2014 12 / 18

Page 13: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Discrete-State Modeling: Winner-Takes-All (WTA)Non-linearity

Assume: x discretely supported,∫7→

∑Winner-Takes-All Non-Linearity: WTA[x ] = ei where i = arg maxj xj

xt+1 = WTA[Axt + w•] w• ∼ N (µ,Q)

yt = Cxt + v• v• ∼ N (0,R)

x ∼WTA[N (µ,Σ)] defines a probability vector π where πi = P(x = ei ) =probability mass assigned by N (µ,Σ) to {z ∈ Sx : ∀j 6= i : (z)i > (z)j}

F. Liang Linear Gaussian Models Nov 2014 13 / 18

Page 14: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Static Discrete-State Modeling: Mixture of Gaussians andVector Quantization

x• = WTA[w•] w• ∼ N (µ,Q)

y• = Cx• + v• v• ∼ N (0,R)

Additional Assumption: A = 0“Mixture of Gaussians”:

P(y•) =∑i

P(x• = ej , y•) =∑i

N (Ci ,R)πi

All Gaussians have same covariance R

Inference:

P(x• = ej |y•) =P(x• = ej , y•)

P(y•)=N (Cj ,R)πj∑i N (Ci ,R)πi

Vector Quantization: R = limε→0 R0F. Liang Linear Gaussian Models Nov 2014 14 / 18

Page 15: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Dynamic Discrete-State Modeling: Hidden Markov Models

xt+1 = WTA[Axt + w•] w• ∼ N (0,Q)

yt = Cxt + v• v• ∼ N (0,R)

Theorem

Any Markov chain transition dynamics T can be equivalently modeledusing A and Q in the above model and vice versa.

All states have same emission covariance R

Learning: EM Algorithm (Baum-Welch)

Inference: Viterbi Algorithm for MAP estimate

In discrete case, MAP estimate 6= least-squares estimateApproaches Kalman filtering as discretization gets finer

F. Liang Linear Gaussian Models Nov 2014 15 / 18

Page 16: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Conclusions

Linearity and normality =⇒ computationally tractable

Universal basic model generalizes idiosyncratic special cases andhighlights relationships (e.g. static vs dynamic, zero noise limit,hyperparameter selection)

Unified set of equations and algorithms for inference and learning

F. Liang Linear Gaussian Models Nov 2014 16 / 18

Page 17: A Unifying Review of Gaussian Linear Models (Roweis 1999)

Critique / Future Work

Critique:

Unified algorithms not the most efficient

Can only model y with support Rp, x with support Rk or {1, . . . , n}Future Work:

Increase hierarchy beyond two levels (e.g. Speech → n-gram →PCFG)

Relax time homogeneity assumption (e.g. Extended Kalman Filter)

Extend to other distributions

Try other (likelihood,conjugate prior) pairsApproximate inference (MH-MCMC)

F. Liang Linear Gaussian Models Nov 2014 17 / 18

Page 18: A Unifying Review of Gaussian Linear Models (Roweis 1999)

References

S. Roweis, Z. Ghahramani.A Unifying Review of Linear Gaussian Models.Computation and Neural Systems, 11(2):305–345, 1999.

Image Attributions:

http://www.robots.ox.ac.uk/ parg/projects/ica/riz/Thesis/Figs/var/MoG.jpeg

https://github.com/echen/restricted-boltzmann-machines

http://upload.wikimedia.org/wikipedia/commons/1/15/GaussianScatterPCA.png

http://www.ee.columbia.edu/ln/LabROSA/doc/HTKBook21/img15.gif

http://commons.wikimedia.org/wiki/File:Basic concept of Kalman filtering.svg

http://learning.cis.upenn.edu/cis520 fall2009/uploads/Lectures/pca-example-1D-of-2D.png

F. Liang Linear Gaussian Models Nov 2014 18 / 18