1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics,...

11

Introduction to Kernel Principal Introduction to Kernel Principal Component Analysis(PCA)Component Analysis(PCA)

Mohammed Nasser Dept. of Statistics, RU,Bangladesh

Email: [email protected]

Contents

Basics of PCA

Application of PCA in Face Recognition

Some Terms in PCA

Motivation for KPCA

Basics of KPCA

Applications of KPCA

High-dimensional Data

Gene expression Face images Handwritten digits

Why Feature Reduction?

• Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality– Query accuracy and efficiency degrade rapidly as the

dimension increases.

• The intrinsic dimension may be small. – For example, the number of genes responsible for a

certain type of disease may be small.

Why Reduce Dimensionality?

1. Reduces time complexity: Less computation

2. Reduces space complexity: Less parameters

3. Saves the cost of observing the feature

4. Simpler models are more robust on small datasets

5. More interpretable; simpler explanation

6. Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions

Feature reduction algorithms

• Unsupervised

– Latent Semantic Indexing (LSI): truncated SVD

– Independent Component Analysis (ICA)

– Principal Component Analysis (PCA)

– Canonical Correlation Analysis (CCA)

• Supervised

– Linear Discriminant Analysis (LDA)

• Semi-supervised

– Research topic

Algebraic derivation of PCs

• Main steps for computing PCs

– Form the covariance matrix S.

– Compute its eigenvectors:

– Use the first d eigenvectors to form the d PCs.

– The transformation G is given by

1 2[ , , , ]dG u u u

1

p

i iu

1

d

i iu

.point A test dTp xGx

Optimality property of PCA

npTndT

ndTnp

XGGXXG

XGX

)(

Dimension reductionReconstruction

ndT XGY

pdTG

npX

Original data

dpG npX

Optimality property of PCA

2

FXX

The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem:

Main theoretical result:

dF

T

GIGXGGXdp

T2G subject to )(min

reconstruction error

PCA projection minimizes the reconstruction error among all linear projections of size d.

Dimensionality Reduction

• One approach to deal with high dimensional data is by reducing their dimensionality.

• Project high dimensional data onto a lower dimensional sub-space using linear or non-linear transformations.

Dimensionality Reduction

• Linear transformations are simple to compute and tractable.

• Classical –linear- approaches:– Principal Component Analysis (PCA) – Fisher Discriminant Analysis (FDA)

–Singular Value Decomosition (SVD)

--Factor Analysis (FA)

--Canonical Correlation(CCA)

( )ti i iY U X b u a

k x 1 k x d d x 1 (k<<d)k x 1 k x d d x 1 (k<<d)

Principal Component Analysis (PCA)

• Each dimensionality reduction technique finds an appropriate transformation by satisfying certain criteria (e.g., information loss, data discrimination, etc.)

• The goal of PCA is to reduce the dimensionality of the data while retaining as much as possible of the variation present in the dataset.


1 1 2 2

1 2

ˆ ...

where , ,..., is a basein the -dimensionalsub-space (K<N)K K

K

x b u b u b u

u u u K

x̂ x

1 1 2 2

1 2

...

where , ,..., is a basein theoriginal N-dimensionalspaceN N

n

x a v a v a v

v v v

• Find a basis in a low dimensional sub-space:

– Approximate vectors by projecting them in a low dimensional sub-space:

(1) Original space representation:

(2) Lower-dimensional sub-space representation:

• Note: if K=N, then

Principal Component Analysis (PCA)• Example (K=N):


• Methodology

– Suppose x1, x2, ..., xM are N x 1 vectors


• Methodology – cont.

( )Ti ib u x x


• Linear transformation implied by PCA

– The linear transformation RN RK that performs the dimensionality reduction is:


• How many principal components to keep?

– To choose K, you can use the following criterion:

Unfortunately for some data sets to meet this requirement we need K almost equal to N. That is, no effective data reduction is possible.


• Eigenvalue spectrum

λiKλN

Scree plot


• Standardization– The principal components are dependent on the units

used to measure the original variables as well as on the range of values they assume.

– We should always standardize the data prior to using PCA.

– A common standardization method is to transform all the data to have zero mean and unit standard deviation:

CS 479/679Pattern Recognition – Spring 2006

Dimensionality Reduction Using PCA/LDAChapter 3 (Duda et al.) – Section 3.8

Case Studies:Face Recognition Using Dimensionality Reduction

M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, 3(1), pp. 71-86, 1991.

D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image Retrieval", IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), pp. 831-836, 1996.

A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, 2001.


• Face Recognition

– The simplest approach is to think of it as a template matching problem

– Problems arise when performing recognition in a high-dimensional space.

– Significant improvements can be achieved by first mapping the data into a lower dimensionality space.

– How to find this lower-dimensional space?

Principal Component Analysis (PCA)• Main idea behind eigenfaces

average face

Principal Component Analysis (PCA)• Computation of the eigenfaces


• Computation of the eigenfaces – cont.

Principal Component Analysis (PCA)• Computation of the eigenfaces – cont.

ui

Mind that this is norm

alized..

Principal Component Analysis (PCA)• Computation of the eigenfaces – cont.

Principal Component Analysis (PCA)• Representing faces onto this basis


• Representing faces onto this basis – cont.


• Face Recognition Using Eigenfaces


• Face Recognition Using Eigenfaces – cont.

– The distance er is called distance within the face space (difs)

– Comment: we can use the common Euclidean distance to compute er, however, it has been reported that the Mahalanobis distance performs better:


• Face Detection Using Eigenfaces


• Face Detection Using Eigenfaces – cont.

Principal Components Analysis

So, principal components are given by:

b1 = u11x1 + u12x2 + ... + u1NxN

b2 = u21x1 + u22x2 + ... + u2NxN

...

bN= aN1x1 + aN2x2 + ... + aNNxN

xj’s are standardized if correlation matrix is used (mean 0.0, SD 1.0)

Score of ith unit on jth principal component

bi,j = uj1xi1 + uj2xi2 + ... + ujNxiN

PCA Scores

4.0 4.5 5.0 5.5 6.02

3

4

5

xi2

xi1

bi,1 bi,2


Amount of variance accounted for by:

1st principal component, λ1, 1st eigenvalue

2nd principal component, λ2, 2ndeigenvalue

...

λ1 > λ2 > λ3 > λ4 > ...

Average λj = 1 (correlation matrix)

Principal Components Analysis:Eigenvalues

4.0 4.5 5.0 5.5 6.02

3

4

5

λ1λ2

U1

PCA: Terminology• jth principal component is jth eigenvector of

correlation/covariance matrix• coefficients, ujk, are elements of eigenvectors and relate original

variables (standardized if using correlation matrix) to components• scores are values of units on components (produced using

coefficients)• amount of variance accounted for by component is given by

eigenvalue, λj

• proportion of variance accounted for by component is given by λj / Σ λj

• loading of kth original variable on jth component is given by ujk

√λj --correlation between variable and component


• Covariance Matrix:

– Variables must be in same units

– Emphasizes variables with most variance

– Mean eigenvalue ≠1.0

– Useful in morphometrics, a few other cases

• Correlation Matrix:

– Variables are standardized (mean 0.0, SD 1.0)

– Variables can be in different units

– All variables have same impact on analysis

– Mean eigenvalue = 1.0

PCA: Potential Problems

• Lack of Independence– NO PROBLEM

• Lack of Normality– Normality desirable but not essential

• Lack of Precision– Precision desirable but not essential

• Many Zeroes in Data Matrix– Problem (use Correspondence Analysis)


• PCA and classification (cont’d)

z

v

-3 -2 -1 0 1 2 3

-4-2

02

4Motivation

z

u

-3 -2 -1 0 1 2 3

02

46

8 ???????

Motivation

Motivation

Linear projections will not detect thepattern.

Limitations of linear PCA

1,2,3=1/3

Nonlinear PCA

Three popular methods are available:

1) Neural-network based PCA (E. Oja, 1982)

2)Method of Principal Curves (T.J. Hastie and W. Stuetzle, 1989)

3) Kernel based PCA (B. Schölkopf, A. Smola, and K. Müller, 1998)

PCA

NPCA

Kernel PCA: The main ideaKernel PCA: The main idea

A Useful Theorem for Hilbert space

Let be a Hilbert space and x1, ……xn in . Let =span{x1, ……xn}. Also u and v in .

<xi,u>=<xi,v>, i=1,……,n implies u=v

Proof.

Try your self.

Kernel methods in PCAKernel methods in PCA

Linear PCA Cw w ( 1)

where C is covariance matrix for centered data X:

1

1 2

1Cw (x ' )

span{ ,..... } if 0

n

i ii

w x wn

w x x

'

1

1C x x

l

i iin

(1) and (2) are equivalent conditions.

, , i=1......l i ix w x Cw (2)

Kernel methods in PCAKernel methods in PCA

Now let us suppose:

In Kernel PCA, we do the PCA in feature space.

1

1C (x ) (x ) (what is its meaning??)

lT

i iil

remember about centering!

1

1Cv (x ), (x )

l

i ii

v vl

(*)

: ,the feature spacepR F

Possibly is a very high dimension space.

Kernel Methods in PCAKernel Methods in PCA

Again all solutions with lie in the space generated by

v 0

{ ( ), , ( )}i lx x

It has two useful consequences:

1}

1

span of{ ( ), , ( )}

( )

i ll

i ii

v x x

v x

2) We may instead solve the set of equations

( ), ( ), i=1......l i ix v x Cv

Defining an lxl kernel matrix K:

)x(,)x(x,x jijik


And using the result (1) in ( 2) we get

2 (3)l K K

But we need not solve (3). It can be shown easily that the following simpler system gives us solutions that are interesting to us.

(4)l K

αKα

Compute eigenvalue problem for the kernel matrix

The solutions (k, k) further need to be normalized

by imposing , 1 since should be with 1k k k k

kv v

If x is our new observation, the feature value (??) will be ( )x

and kth principal score will be

1 1

, ( ) ( ), ( ) ( , )l l

k k ki i i i

i i

v x x x K x x


Data centering:

l

iiS l 1

)x(1

)x()x()x()x(ˆ

l

jiji

l

ii

l

ii

l

ii

l

ii

kl

zkl

kl

k

llk

1,2

11

11

)x,x(1

)x,(1

)x,x(1

)zx,(

)x(1

)z(ˆ,)x(1

)x()z(ˆ),x(ˆ)zx,(ˆ

Hence, the kernel for the transformed space is


Expressed as an operation on the kernel matrix this

can be rewritten as

j'jj)K(j'1

j'jK1

Kj'j1

KK̂2

lll

where jj is the all 1s vector.


Linear PCA

Kernel PCA captures the nonlinear structure of the data

AlgorithmAlgorithm

Input: Data X={x1, x2, …, xl} in n-dimensional space.

Process: Ki,j= k(xi,xj); i,j=1,…, l.

2

( )

( )

1 1

1 1 1K̂ K j j' K K j j' (j' K j) j j';

ˆ[V, ] eig(K);

1, 1,..., .

x (x ,x)

jj

jk

lj

j i ii j

l l l

v j l

k

Output: Transformed data

… for centered data

Kernel matrix ...

k-dimensional vector projection of new

data into this subspace

Reference

• I.T. Jolliffe. (2002)Principal Component Analysis. • . Schölkopf, et al. (1998 Kernel Principal Component

Analysis)/• B. . Schölkopf and A.J. Smola(2000/20012002)

Learning with Kernels • Christopher J C Burges (2005).Geometric Methods for

Feature Extraction and Dimensional Reduction.

1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics,...

Documents

Transcript of 1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics,...