Pattern Recognition and Machine Learning

Lars Kasper, December 15th 2010

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 12: CONTINUOUS LATENT VARIABLES

Relation To Other Topics

• Last weeks: Approximate Inference• Today: Back to• data-preprocessing• Data representation/Feature extraction• “Model-free” analysis• Dimensionality reduction• The matrix

• Link: We also have a (particular easy) model of the underlying state of the world whose parameters we want to infer from the data

Take-home TLAs (Three-letter acronyms)

Although termed “continuous latent variables”, we mainly deal with• PCA (Principal Component Analysis)• ICA (Independent Component Analysis)• Factor analysis

General motivation/theme: “What is interesting about my data – but hidden (latent)? …And what is just noise?”

Importance Sampling ;-) 1996 2 0.1918 %1997 3 0.2876 %1998 7 0.6711 %1999 17 1.6299 %2000 33 3.1640 %2001 41 3.9310 %2002 54 5.1774 %2003 53 5.0815 %2004 77 7.3826 %2005 85 8.1496 %2006 98 9.3960 %2007 115 11.0259 %2008 139 13.3269 %2009 160 15.3404 %2010 157 15.0527 %

Publications concerningfMRI and (PCA or ICA or factorAnalysis)Source: ISI Web of Knowledge, Dec 13th, 2010

Importance Sampling: fMRI

MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time series of a visual block stimulation

• Used for fMRI analysis, e.g. software package FSL: “MELODIC”

Motivation: Low intrinsic dimensionality

• Generating hand-written digit samples by translating and rotating one example 100 times

• High dimensional data (100 x 100 pixel)• Low degrees of freedom (1 rotation angle, 2 translations)

Roadmap for today

Standard PCA (heuristic)

•Dimensionality Reduction

•Maximum Variance•Minimum Error

Probabilistic PCA (Maximum Likelihood)

•Generative Probabilistic Model

•ML-equivalence to Standard PCA

Bayesian PCA

•Automatic determination of latent space dimension

Generalizations

•Relaxing equal data noise amplitude: Factor analysis

•Relaxing Gaussianity: ICA

•Relaxing Linearity: Kernel PCA

Heuristic PCA: Projection View

How do we simplify or compress our data (make it low-dimensional) without losing actual information? Dimensionality reduction by projecting on a linear subspace

2D-data

Projected on 1D-line

Heuristic PCA: Dimensionality Reduction

High dimensional data

• Data points

Projection Low-Dimensional Subspace

• Dimension • Projected data

points

Advantages:• Reduced amount of data• Might be easier to reveal structure withinin the data (pattern recognition, data

visualization)

Heuristic PCA: Maximum Variance View

• We want to reduce the dimensionality of our data space via a linear projection.

• But we still want to keep the projected samples as different as possible.

• A good measure for this difference is the data covariance expressed by the matrix

• Note: This expresses the covariance between different data dimensions, not between data points.

• We now aim to maximize the variance of the projected data in the projection space spanned by the basis vectors .

𝒙−mean of all data points ,𝑁−number of data points

Maximum Variance View: The Maths

• Maximum variance formulation of 1D-projection with projection vector :

• Constraint optimization:

• Leads to best projector being an eigenvector of , the data covariance matrix:

• with maximum projected variance equal to the

maximum eigenvalue:

Heuristic PCA: Conclusion

By induction we yield the general PCA result to maximize the variance of the data in the projected dimensions:

The projection vectors shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix . These vectors are called

the principal components.

Heuristic PCA: Minimum error formulation

• By projecting, we want to lose as few information as possible, i.e. keep the projected data points as similiar to the raw data as possible.

• Therefore we minimize the mean quadratic error

• With respect to the projection vectors .• This leads to the same result as in the maximum

variance formulation: shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix .

Example: Eigenimages

Eigenimages II

Christopher DeCoro http://www.cs.princeton.edu/cdecoro/eigenfaces/

http://www.cs.princeton.edu/~cdecoro/eigenfaces/

Dimensionality Reduction

Roadmap for today







Bayesian PCA


Generalizations




Probabilistic PCA: A synthesizer’s view

𝒙=𝑊 𝒛 +𝝁+𝝐• – standardised normal distribution

• Independent latent variables with zero mean & unit variance• – a spherical Gaussian

• i.e. identical independent noise in each of the data dimensions• Prior predictive or marginal distribution of data points:

Probabilistic PCA: ML-solution

Same as in heuristic PCA matrix of first eigenvectors, diagonal matrix of eigenvalues Only specified up to a rotation in latent space

Recap: The EM-algorithm

• The Expectation-Maximization algorithm determines the Maximum Likelihood-solution for our model parameters iteratively

• Advantageous compared to direct eigenvector decomposition, if , i.e. if we have considerably fewer latent variables than data dimensions• Projection on a very low dimensional space, e.g.

for data visualization to

EM-Algorithm: Expectation Step

• We consider the complete-data likelihood

• Maximizing the marginal likelihood instead would need an integration over latent space

• E-Step: The posterior distribution of the latent variables is updated and used to calculate the expected value of the complete-data log likelihood with respect to

• Keeping estimates of fixed

EM-Algorithm: Maximization Step

• M-Step: The calculated expectation is now maximized with respect to :

• keeping the estimated posterior distribution of fixed from the E-Step

EM-algorithm for ML-PCA

Green dots: Data points, always fixedExpectation: Red rod is fixed, cyan connection of blue springs moves

obeying spring forces (Maximization: Cyan connections are fixed, red rod moves

(obey spring forces)

M

E M

𝑊𝑍𝑊 𝑇

Roadmap for today







Bayesian PCA


Generalizations




Bayesian PCA – Finding the real dimension

MaximumLikelihood

BayesianPCA

Introducing hyperparameters, marginalizing :

𝑥=𝑊𝑧+𝜇+ϵ

Estimating

Estimated projection matrix for an dimensional latent variable model and synthetic data generated from a latent model with

Roadmap for today







Bayesian PCA


Generalizations




Factor Analysis: A non-spherical PCA

with )

• Noise is still independent and Gaussian

• Controversy: Do thefactors (dimensions of ) have an interpretable meaning?• Problem: posterior invariant wrt rotations of

Independent Component Analysis (ICA)

with • Still a linear model of independent components• No data noise components, for dim(latent space) =

dim(data space)• Explicitly Non-Gaussian• Otherwise, no separation of mixing coefficients in from

latent variables would be possible• Rotational symmetry

• Maximization of Non-Gaussianity/Independence• Different criteria, e.g. kurtosis, skewness • Minimization of mutual information

ICA vs PCA

• ICA rewards bi-modality of projected distribution• PCA rewards maximum variance between elements

PCA 1st principal component

ICA 1st independentcomponent

Unsupervised method:No class labels!

Summary

Parameter estimation

Heuristic quadratic cost function

(Minimum Error Projection)

Probabilistic (Maximum Likelihood

projection matrix)

Bayesian (Hyperparameters

of projection vectors)

Generative probabilistic

process in latent space

Standardized normal distribution

(PCA)

Standardized normal distribution

(Factor Analysis)

Independent probabilistic

process for each dimension (ICA)

Noise in data space

Spherical Gaussian(PCA)

Gaussian(Factor Analysis)

None (ICA)

Feature Mapping (latent to data space)

Linear: PCA, ICA, Factor Analysis

Nonlinear: Kernel PCA

Relation To Other Topics

• Today• data-preprocessing• Whitening via covariance => Identity

• Data representation/Feature extraction• “Model-free” analysis• Well: NO! We have seen the model assumptions in probabilistic

PCA • Dimensionality reduction• Via projection on basis vectors carrying the most

variance/leaving the smallest error• At least for linear models, not for kernel PCA

• The matrix

Kernel PCA

𝐶= 1𝑁 ∑

𝑛=1

𝑁

𝑥𝑛𝑥𝑛𝑇 𝐶= 1

𝑁 ∑𝑛=1

𝑁

Φ(𝑥¿¿𝑛)⋅Φ (𝑥𝑛 )𝑇 ¿

• Instead of the sample covariance matrix, we now consider a covariance matrix in a feature space

• As always, the kernel trick of not computing in the high-dimensional feature space works, because the covariance matrix only needs scalar products of the

Kernel PCA – Example: Gaussian kernel

• Kernel PCA does not enable dimensionality reduction via • is a manifold in feature space, not a linear subspace• The PCA projects onto subspaces in feature space with elements • These elements typically not lie in , so their pre-images ) will not be in data

space

Pattern Recognition and Machine Learning

Documents

Transcript of Pattern Recognition and Machine Learning