- Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems...

- Southeast University -

1

PCA and Kernel PCAPCA and Kernel PCA

Presented by Shicai YangInstitute of Systems Engineering

April 18, 2023


2

Outline

• PCA

• Kernel Methods

• Kernel PCA

• Others


3

1. PCA Overview

• Principal component analysis (PCA) is a way to reduce data dimensionality

• PCA projects high dimensional data to a lower dimension• PCA projects the data in the least square sense– it

captures big (principal) variability in the data and ignores small variability


4

PCA: An Intuitive Approach

N

iiN 1

0

1xmx

Let us say we have xi, i=1…N data points in p dimensions (p is large)

If we want to represent the data set by a single point x0, then

Can we justify this choice mathematically?

N

iiJ

1

2

000 )( xxx

It turns out that if you minimize J0, you get the above solution, namely, sample mean

Sample mean


5

PCA: An Intuitive Approach…

emx a

Representing the data set xi, i=1…N by its mean is quite uninformative

So let’s try to represent the data by a straight line of the form:

This is equation of a straight line that says that it passes through m

e is a unit vector along the straight line

And the signed distance of a point x from m is a

The training points projected on this straight line would be

Niaii ...1, emx


6


N

iii

N

i

Ti

N

ii

N

iii

N

i

Ti

N

ii

N

iiiN

aa

aa

aaaaJ

1

2

11

2

1

2

11

22

1

2

211

||||)(2

||||)(2||||

),,,,(

mxmxe

mxmxee

xeme

)( mxe iT

ia

N

ii

TN

ii

N

i

Tii

T SJ1

2

1

2

11 ||||||||))(()( mxeemxemxmxee

Let’s now determine ai’s

Partially differentiating with respect to ai we get:

Plugging in this expression for ai in J1 we get:

where

N

i

TiiS

1

))(( mxmx is called the scatter matrix


7

So minimizing J1 is equivalent to maximizing:


ee ST

1eeT

)1( eeee TTS

Subject to the constraint that e is a unit vector:

Use Lagrange multiplier method to form the objective function:

Differentiate to obtain the equation: eSe0ee orS 22

Solution is that e is the eigenvector of S corresponding to the largest eigenvalue


8


ddaa eemx 11

N

ii

d

kkikd aJ

1

2

1

||)(|| xem

The preceding analysis can be extended in the following way.

Instead of projecting the data points on to a straight line, we may

now want to project them on a d-dimensional plane of the form:

d is much smaller than the original dimension p

In this case one can form the objective function:

It can also be shown that the vectors e1, e2, …, ed are d eigenvectors

corresponding to d largest eigen values of the scatter matrix

N

i

TiiS

1

))(( mxmx


9

PCA: Visually

Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors.


10

PCA Steps

• 设 x = ( x1 , x2 , , x⋯ n)T 为 n 维随机矢量⑴ 将原始观察数据组成观察矩阵 X ，每一列为一个观察

样本，每一行为一维⑵ 计算样本 X 的协方差矩阵 covX=COV(X)

⑶ 计算 covX 的特征值和特征向量，并将特征值按从大到小排列

⑷ 选取前 m 个最大特征值对应的特征向量组成矩阵 V

⑸ Y=VTX ，则 Y 为降维后的矩阵


11

PCA 的 Matlab函数与算法

1.princomp ：主成分分析• PC=princomp(X)• [PC,score,latent,tsquare]=princomp(X)

– 对数据矩阵 X(N*p ，行 - 观察样本数，列 - 特征变量数 ) 进行主成分分析，给出各主成分 (PC) 、所谓的 Z- 得分 (score) 、 X 的方差矩阵的特征值(latent) 和每个数据点的 HotellingT2 统计量 (tsquare) 。

2.pcacov ：运用协方差矩阵进行主成分分析• PC=pcacov(X)• [PC,latent,explained]=pcacov(X)

– 通过协方差矩阵 X 进行主成分分析，返回主成分 (PC) 、协方差矩阵 X的特征值 (latent) 和每个特征向量表征在观测量总方差中所占的百分数(explained) 。


12

3.pcares ：主成分分析的残差• residuals=pcares(X,ndim)

– 返回保留 X 的 ndim 个主成分所获的残差。注意， ndim 是一个标量，必须小于 X 的列数。而且， X 是数据矩阵，而不是协方差矩阵。

4.barttest ：主成分的巴特力特检验• ndim=barttest(X,alpha)

• [ndim,prob,chisquare]=barttest(X,alpha)– 巴特力特检验是一种等方差性检验。 ndim=barttest(X,alpha) 是在显

著性水平 alpha 下，给出满足数据矩阵 X 的非随机变量的 n 维模型，ndim 即模型维数，它由一系列假设检验所确定， ndim=1 表明数据X 对应于每个主成分的方差是相同的； ndim=2 表明数据 X 对应于第二成分及其余成分的方差是相同的。


13

计算协方差

(1)XCOV=COV(X)

(2) % row 观察样本， col 特征变量，返回的 cv 为协方差xmean=mean(x); xsize=size(x);

for i=1:xsize(2)

xx1=x(:,i);

mxx1=xmean(i);

for j=1:xsize(2)

xx2=x(:,j);

mxx2=xmean(j);

v=((xx1-mxx1)'*(xx2-mxx2))/(xsize(2)-1);

cv(i,j)=v;

cv(j,i)=v;

end

end


14

PCA 的 Matlab实现

function [xeigvsort,xeigdsort,final]=KL_Exp(x)

xmean=mean(x);xsize=size(x);

for i=1:xsize(2)

xadjust(:,i)=x(:,i)-xmean(:,i);

end

xcov=cov(xadjust); % 计算协方差[xeigv,xeigd]=eig(xcov); % 计算特征值和特征向量xeigvsort=fliplr(xeigv); % 特征向量 v 排序xeigdsort=flipud(fliplr(xeigd)); % 特征值 d 降序排序finaleigs=xeigvsort(:,1:xsize(2)); 选取变换基， xsize(2) 可调pdata=finaleigs‘*xadjust’; % 进行变换final=pdata';


15

假设和局限

• 线形性假设– PCA 的内部模型是线性的。这也就决定了它能进行的主元分析之

间的关系也是线性的。现在比较流行的 Kernel-PCA 的一类方法就是对原有 PCA 方法的非线性拓展。

• 使用中值和方差进行充分统计– 使用中值和方差进行充分的概率分布描述的模型只限于指数型概

率分布模型。若数据的概率分布是 non-Gaussian ，那么 PCA 将会失效， ICA 方法将发挥作用。


16

• 大方差向量具有较大重要性– PCA 方法隐含了这样的假设：数据本身具有较高的信噪比，所以

具有最高方差的一维向量就可以被看作是主元，而方差较小的变化则被认为是噪音。这是由于低通滤波器的选择决定的。

• 主元正交– PCA 方法假设主元向量之间都是正交的，从而可以利用线形代数

的一系列有效的数学工具进行求解，大大提高了效率和应用的范围。


17

2. Kernel Methods

• Find a mapping such that, in the new space, problem solving is easier (e.g. linear)

• The kernel represents the similarity between two objects, defined as the dot-product in this new vector space

• But the mapping is left implicit• Easy generalization of a lot of dot-product (or distance)

based pattern recognition algorithms


18

Kernel Methods : the mapping

Original Space Feature (Vector) Space


19

Feature Spaces

: ( ), dx x R F

Non-linear Mapping to F 1. High-d Space2. Infinite-d Countable Space: L23. Function Space (Hilbert Space)

2 2( , ) ( , , 2 )x y x y xyExample:


20

Kernel : more formal definition

• A kernel k(x,y) – is a similarity measure – defined by an implicit mapping – from the original space to a vector space (feature space) – such that: k(x,y)=x)•y)

• This similarity measure and the mapping include:– Invariance or other a priori knowledge– Simpler structure (linear representation of the data)– The class of functions the solution is taken from– Possibly infinite dimension (hypothesis space for learning)– … but still computational efficiency when computing k(x,y)

General Principles governing Kernel Design


21

Kernel Trick

Note: In the dual representation we used the Gram matrix to express the solution.

Kernel Trick: Replace :

( ),

, ( ), ( ) ( , )ij i j ij i j i j

x x

G x x G x x K x x

kernel

If we use algorithms that only depend on the Gram-matrix, G,then we never have to know (compute) the actual features

This is the crucial point of kernel methods


22

ModularityKernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.

Some Kernels:

2( || || / )

2 2

( , )

( , ) ( , )

( , ) tanh( , )

1( , )

|| ||

x y c

d

k x y e

k x y x y

k x y x y

k x yx y c

Some Kernel Algorithms:

- SVM- Fisher LDA(KFDA)- Kernel Regression- Kernel PCA- Kernel CCA


23

Benefits from kernels

• Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, …– When these algorithms are dot-product based, by replacing the

dot product (x•y) by k(x,y)=x)•y)

e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, …

NM. This often implies to work with the “dual” form of the algo.– When these algorithms are distance-based, by replacing d(x,y)

by k(x,x)+k(y,y)-2k(x,y)

• Freedom of choosing implies a large variety of learning algorithms


24

3. Kernel PCA

• Assumption behind PCA is that the data points x are multivariate Gaussian

• Often this assumption does not hold

• However, it may still be possible that a transformation (x) is still Gaussian, then we can perform PCA in the space of (x)

• Kernel PCA performs this PCA; however, because of “kernel trick,” it never computes the mapping (x) explicitly!


25

KPCA: Basic Idea


26

Kernel PCA Formulation

• We need the following fact:

• Let v be a eigenvector of the scatter matrix:

• Then v belongs to the linear space spanned by the data points xi i=1, 2, …N.

• Proof:

N

i

TiiS

1

xx

N

iii

N

i

TiiS

11

)(1

xvxxvvv


27

Kernel PCA Formulation…

• Let C be the scatter matrix of the centered mapping (x):

• Let w be an eigenvector of C, then w can be written as a linear combination:

• Also, we have:

• Combining, we get:

N

i

TiiC

1

)()( xx

N

kkk

1

)(xw

ww C

N

kkk

N

kkk

N

i

Tii

111

)())()()()(( xxxx


28


).()(where,

,,2,1,)()()()()()(

)()()()(

)())()()()((

2

11 1

11 1

111

jT

iij

N

kk

Tlk

N

i

N

kkk

Tii

Tl

N

kkk

N

i

N

kkk

Tii

N

kkk

N

kkk

N

i

Tii

KK

KK

Nl

xxαα

αα

xxxxxx

xxxx

xxxx

Kernel or Gram matrix

Sv v


29


αα KFrom the eigen equation

And the fact that the eigenvector w is normalized to 1, we obtain:

1

1))(())((||||11

2

αα

ααxxw

T

TN

iii

TN

iii K


30

KPCA AlgorithmStep 1: Compute the Gram matrix: NjikK jiij ,,1,),,( xx

Step 2: Compute (eigenvalue, eigenvector) pairs of K: Mlll ,,1),,( α

Step 3: Normalize the eigenvectors:l

ll

α

α

Thus, an eigenvector wl of C is now represented as:

N

kk

lk

l

1

)(xw

To project a test feature (x) onto wl we need to compute:

N

kk

lk

N

kk

lk

TlT k11

),())(()()( xxxxwx So, we never need explicitly


31

Examples of Kernels

Polynomial kernel (n=2)

RBF kernel (n=2)


32

4. Others

• 2DPCA– 南京理工大学杨静宇教授等 , IEEE T-PAMI, 2004(1)– 2DPCA 特征提取效果至少要好于 PCA, 不过要求的内存比 PCA

大。• 2DLDA

– 北京交通大学袁保宗教授等 , P. R. Letters, 2005(3)

• Kernel ECA （ KECA ）– Robert Jensson et.al, IEEE T-PAMI, 2010(5)– 最大熵保留，使熵减最少，巧妙地将熵与核方法数据映射结合，

将熵的计算顺水推舟化作核矩阵的计算，于是变成一个核空间里的优化问题


33

References[1] J. T. Y. Kwok and I. W. H. Tsang, "The Pre-Image Problem in Kernel Methods," IEEE

Transactions on Neural Networks, vol. 15, pp. 1517-1525, 2004.

[2] S. Mika, et al., "Kernel PCA and De-Noising in Feature Spaces," in Proceedings of the 1998 conference on Advances in Neural Information Processing Systems II, 1999.

[3] B. Schölkopf, et al., "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998.

[4] R. Jenssen, "Information Theoretic Learning and Kernel Methods," in Information Theory and Statistical Learning, Springer US, 2009, pp. 209-230.

[5] R. Jenssen, et al., "Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm," in Proceedings of the 2006 conference on Advances in Neural Information Processing Systems 19, 2007, pp. 633-640.

[6] R. Jenssen and O. Storås, "Kernel Entropy Component Analysis Pre-images for Pattern Denoising," in Proceedings of the 16th Scandinavian Conference on Image Analysis, Oslo, Norway, 2009, pp. 626-635.

[7] R. Jenssen, "Kernel Entropy Component Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 847-860, 2010.

- Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems...

Documents

Transcript of - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems...