Pattern Recognition and Machine Learning

33
Lars Kasper, December 15 th 2010 PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 12: CONTINUOUS LATENT VARIABLES

description

Lars Kasper, December 15 th 2010. Pattern Recognition and Machine Learning. Chapter 12: Continuous Latent Variables. Relation To Other Topics. Last weeks: Approximate Inference Today: Back to data-preprocessing Data representation/Feature extraction “Model-free” analysis - PowerPoint PPT Presentation

Transcript of Pattern Recognition and Machine Learning

Page 1: Pattern Recognition  and  Machine Learning

Lars Kasper, December 15th 2010

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 12: CONTINUOUS LATENT VARIABLES

Page 2: Pattern Recognition  and  Machine Learning

Relation To Other Topics

• Last weeks: Approximate Inference• Today: Back to• data-preprocessing• Data representation/Feature extraction• “Model-free” analysis• Dimensionality reduction• The matrix

• Link: We also have a (particular easy) model of the underlying state of the world whose parameters we want to infer from the data

Page 3: Pattern Recognition  and  Machine Learning

Take-home TLAs (Three-letter acronyms)

Although termed “continuous latent variables”, we mainly deal with• PCA (Principal Component Analysis)• ICA (Independent Component Analysis)• Factor analysis

General motivation/theme: “What is interesting about my data – but hidden (latent)? …And what is just noise?”

Page 4: Pattern Recognition  and  Machine Learning

Importance Sampling ;-) 1996 2 0.1918 %1997 3 0.2876 %1998 7 0.6711 %1999 17 1.6299 %2000 33 3.1640 %2001 41 3.9310 %2002 54 5.1774 %2003 53 5.0815 %2004 77 7.3826 %2005 85 8.1496 %2006 98 9.3960 %2007 115 11.0259 %2008 139 13.3269 %2009 160 15.3404 %2010 157 15.0527 %

Publications concerningfMRI and (PCA or ICA or factorAnalysis)Source: ISI Web of Knowledge, Dec 13th, 2010

Page 5: Pattern Recognition  and  Machine Learning

Importance Sampling: fMRI

MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time series of a visual block stimulation

• Used for fMRI analysis, e.g. software package FSL: “MELODIC”

Page 6: Pattern Recognition  and  Machine Learning

Motivation: Low intrinsic dimensionality

• Generating hand-written digit samples by translating and rotating one example 100 times

• High dimensional data (100 x 100 pixel)• Low degrees of freedom (1 rotation angle, 2 translations)

Page 7: Pattern Recognition  and  Machine Learning

Roadmap for today

Standard PCA (heuristic)

•Dimensionality Reduction

•Maximum Variance•Minimum Error

Probabilistic PCA (Maximum Likelihood)

•Generative Probabilistic Model

•ML-equivalence to Standard PCA

Bayesian PCA

•Automatic determination of latent space dimension

Generalizations

•Relaxing equal data noise amplitude: Factor analysis

•Relaxing Gaussianity: ICA

•Relaxing Linearity: Kernel PCA

Page 8: Pattern Recognition  and  Machine Learning

Heuristic PCA: Projection View

How do we simplify or compress our data (make it low-dimensional) without losing actual information? Dimensionality reduction by projecting on a linear subspace

2D-data

Projected on 1D-line

Page 9: Pattern Recognition  and  Machine Learning

Heuristic PCA: Dimensionality Reduction

High dimensional data

• Data points

Projection Low-Dimensional Subspace

• Dimension • Projected data

points

Advantages:• Reduced amount of data• Might be easier to reveal structure withinin the data (pattern recognition, data

visualization)

Page 10: Pattern Recognition  and  Machine Learning

Heuristic PCA: Maximum Variance View

• We want to reduce the dimensionality of our data space via a linear projection.

• But we still want to keep the projected samples as different as possible.

• A good measure for this difference is the data covariance expressed by the matrix

• Note: This expresses the covariance between different data dimensions, not between data points.

• We now aim to maximize the variance of the projected data in the projection space spanned by the basis vectors .

𝒙−mean of all data points ,𝑁−number of data points

Page 11: Pattern Recognition  and  Machine Learning

Maximum Variance View: The Maths

• Maximum variance formulation of 1D-projection with projection vector :

• Constraint optimization:

• Leads to best projector being an eigenvector of , the data covariance matrix:

• with maximum projected variance equal to the

maximum eigenvalue:

Page 12: Pattern Recognition  and  Machine Learning

Heuristic PCA: Conclusion

By induction we yield the general PCA result to maximize the variance of the data in the projected dimensions:

The projection vectors shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix . These vectors are called

the principal components.

Page 13: Pattern Recognition  and  Machine Learning

Heuristic PCA: Minimum error formulation

• By projecting, we want to lose as few information as possible, i.e. keep the projected data points as similiar to the raw data as possible.

• Therefore we minimize the mean quadratic error

• With respect to the projection vectors .• This leads to the same result as in the maximum

variance formulation: shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix .

Page 14: Pattern Recognition  and  Machine Learning

Example: Eigenimages

Page 15: Pattern Recognition  and  Machine Learning

Eigenimages II

Christopher DeCoro http://www.cs.princeton.edu/cdecoro/eigenfaces/

Page 16: Pattern Recognition  and  Machine Learning

Dimensionality Reduction

Page 17: Pattern Recognition  and  Machine Learning

Roadmap for today

Standard PCA (heuristic)

•Dimensionality Reduction

•Maximum Variance•Minimum Error

Probabilistic PCA (Maximum Likelihood)

•Generative Probabilistic Model

•ML-equivalence to Standard PCA

Bayesian PCA

•Automatic determination of latent space dimension

Generalizations

•Relaxing equal data noise amplitude: Factor analysis

•Relaxing Gaussianity: ICA

•Relaxing Linearity: Kernel PCA

Page 18: Pattern Recognition  and  Machine Learning

Probabilistic PCA: A synthesizer’s view

𝒙=𝑊 𝒛 +𝝁+𝝐• – standardised normal distribution

• Independent latent variables with zero mean & unit variance• – a spherical Gaussian

• i.e. identical independent noise in each of the data dimensions• Prior predictive or marginal distribution of data points:

Page 19: Pattern Recognition  and  Machine Learning

Probabilistic PCA: ML-solution

Same as in heuristic PCA matrix of first eigenvectors, diagonal matrix of eigenvalues Only specified up to a rotation in latent space

Page 20: Pattern Recognition  and  Machine Learning

Recap: The EM-algorithm

• The Expectation-Maximization algorithm determines the Maximum Likelihood-solution for our model parameters iteratively

• Advantageous compared to direct eigenvector decomposition, if , i.e. if we have considerably fewer latent variables than data dimensions• Projection on a very low dimensional space, e.g.

for data visualization to

Page 21: Pattern Recognition  and  Machine Learning

EM-Algorithm: Expectation Step

• We consider the complete-data likelihood

• Maximizing the marginal likelihood instead would need an integration over latent space

• E-Step: The posterior distribution of the latent variables is updated and used to calculate the expected value of the complete-data log likelihood with respect to

• Keeping estimates of fixed

Page 22: Pattern Recognition  and  Machine Learning

EM-Algorithm: Maximization Step

• M-Step: The calculated expectation is now maximized with respect to :

• keeping the estimated posterior distribution of fixed from the E-Step

Page 23: Pattern Recognition  and  Machine Learning

EM-algorithm for ML-PCA

Green dots: Data points, always fixedExpectation: Red rod is fixed, cyan connection of blue springs moves

obeying spring forces (Maximization: Cyan connections are fixed, red rod moves

(obey spring forces)

M

E M

𝑊𝑍𝑊 𝑇

Page 24: Pattern Recognition  and  Machine Learning

Roadmap for today

Standard PCA (heuristic)

•Dimensionality Reduction

•Maximum Variance•Minimum Error

Probabilistic PCA (Maximum Likelihood)

•Generative Probabilistic Model

•ML-equivalence to Standard PCA

Bayesian PCA

•Automatic determination of latent space dimension

Generalizations

•Relaxing equal data noise amplitude: Factor analysis

•Relaxing Gaussianity: ICA

•Relaxing Linearity: Kernel PCA

Page 25: Pattern Recognition  and  Machine Learning

Bayesian PCA – Finding the real dimension

MaximumLikelihood

BayesianPCA

Introducing hyperparameters, marginalizing :

𝑥=𝑊𝑧+𝜇+ϵ

Estimating

Estimated projection matrix for an dimensional latent variable model and synthetic data generated from a latent model with

Page 26: Pattern Recognition  and  Machine Learning

Roadmap for today

Standard PCA (heuristic)

•Dimensionality Reduction

•Maximum Variance•Minimum Error

Probabilistic PCA (Maximum Likelihood)

•Generative Probabilistic Model

•ML-equivalence to Standard PCA

Bayesian PCA

•Automatic determination of latent space dimension

Generalizations

•Relaxing equal data noise amplitude: Factor analysis

•Relaxing Gaussianity: ICA

•Relaxing Linearity: Kernel PCA

Page 27: Pattern Recognition  and  Machine Learning

Factor Analysis: A non-spherical PCA

with )

• Noise is still independent and Gaussian

• Controversy: Do thefactors (dimensions of ) have an interpretable meaning?• Problem: posterior invariant wrt rotations of

Page 28: Pattern Recognition  and  Machine Learning

Independent Component Analysis (ICA)

with • Still a linear model of independent components• No data noise components, for dim(latent space) =

dim(data space)• Explicitly Non-Gaussian• Otherwise, no separation of mixing coefficients in from

latent variables would be possible• Rotational symmetry

• Maximization of Non-Gaussianity/Independence• Different criteria, e.g. kurtosis, skewness • Minimization of mutual information

Page 29: Pattern Recognition  and  Machine Learning

ICA vs PCA

• ICA rewards bi-modality of projected distribution• PCA rewards maximum variance between elements

PCA 1st principal component

ICA 1st independentcomponent

Unsupervised method:No class labels!

Page 30: Pattern Recognition  and  Machine Learning

Summary

Parameter estimation

Heuristic quadratic cost function

(Minimum Error Projection)

Probabilistic (Maximum Likelihood

projection matrix)

Bayesian (Hyperparameters

of projection vectors)

Generative probabilistic

process in latent space

Standardized normal distribution

(PCA)

Standardized normal distribution

(Factor Analysis)

Independent probabilistic

process for each dimension (ICA)

Noise in data space

Spherical Gaussian(PCA)

Gaussian(Factor Analysis)

None (ICA)

Feature Mapping (latent to data space)

Linear: PCA, ICA, Factor Analysis

Nonlinear: Kernel PCA

Page 31: Pattern Recognition  and  Machine Learning

Relation To Other Topics

• Today• data-preprocessing• Whitening via covariance => Identity

• Data representation/Feature extraction• “Model-free” analysis• Well: NO! We have seen the model assumptions in probabilistic

PCA • Dimensionality reduction• Via projection on basis vectors carrying the most

variance/leaving the smallest error• At least for linear models, not for kernel PCA

• The matrix

Page 32: Pattern Recognition  and  Machine Learning

Kernel PCA

𝐶= 1𝑁 ∑

𝑛=1

𝑁

𝑥𝑛𝑥𝑛𝑇 𝐶= 1

𝑁 ∑𝑛=1

𝑁

Φ(𝑥¿¿𝑛)⋅Φ (𝑥𝑛 )𝑇 ¿

• Instead of the sample covariance matrix, we now consider a covariance matrix in a feature space

• As always, the kernel trick of not computing in the high-dimensional feature space works, because the covariance matrix only needs scalar products of the

Page 33: Pattern Recognition  and  Machine Learning

Kernel PCA – Example: Gaussian kernel

• Kernel PCA does not enable dimensionality reduction via • is a manifold in feature space, not a linear subspace• The PCA projects onto subspaces in feature space with elements • These elements typically not lie in , so their pre-images ) will not be in data

space