PCA Tutor1

7/31/2019 PCA Tutor1

1/54

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1

Principal Component Analysis andMatrix Factorizations for Learning

Chris DingLawrence Berkeley National Laboratory

Supported by Office of Science, U.S. Dept. of Energy


2/54


Many unsupervised learning methods

are closely related in a simple way

SpectralClustering

NMF

K-meansclustering

PCA

Indicator Matrix

Quadratic Clustering

Semi-supervised

classification

Semi-supervised

clustering

Outlier detection


3/54


Part 1.A.Principal Component Analysis (PCA)

andSingular Value Decomposition (SVD)

Widely used in large number of different fields

Most widely known as PCA (multivariate

statistics)

SVD is the theoretical basis for PCA


4/54


Brief history

PCA Draw a plane closest to data points (Pearson, 1901) Retain most variance (Hotelling, 1933)

SVD Low-rank approximation (Eckart-Young, 1936)

Practical application/Efficient Computation (Golub-Kahan, 1965)

Many generalizations


5/54


PCA and SVD),,,( 21 nxxxX L=Data: n points in p-dim:

Covariance

Principal directions:

(Principal axis,subspace)

ku Principal components:

(projection on the subspace)

kv

=

==p

k

T

kkk

T uuXXC1

==

r

k

T

kkk

T

vvXX1 Gram (kernel) matrix

Underlying basis: SVDT

p

k

Tkkk VUvuX ==

=1


6/54


Further Developments

SVD/PCA

Principal Curves Independent Component Analysis

Sparse SVD/PCA (many approaches)

Mixture of Probabilistic PCA Generalization to exponential familty, max-margin

Connection to K-means clustering

Kernel (inner-product) Kernel PCA


7/54


Methods of PCA Utilization

dkkk XduXuu ++= )()1( 1 L

Principal components(uncorrelated random variables):

Projection to low-dim

subspace

Sphereing the data

Transform data to N(0,1)

Dimension reduction: T

p

k

Tkkk VUvuX ===1

),,,( 21 nxxxX L=

XUX

T=

~ ),,(1 k

uuU L=

XUUXCXT12/1~

==


8/54


Applications of PCA/SVD

Most popular in multivariate statistics

Image processing, signal processing

Physics: principal axis, diagonalization of2nd tensor (mass)

Climate: Empirical Orthogonal Functions(EOF)

Kalman filter. Reduced order analysis

TttttAPAPEsAs )()1()()1( , =+= ++


9/54


Applications of PCA/SVD

PCA/SVD is as widely as Fast Fourier Transforms

Both are spectral expansions FFT is more on Partial Differential Equations

PCA/SVD is more on discrete (data) analysis

PCA/SVD surpass FFT as computational sciencesfurther advance

PCA/SVD

Select combination of variables Dimension reduction

An image has 104 pixels. True dimension is 20 !


10/54


PCAis a Matrix Factorization(spectral/eigen decomposition)

Covariance

Tp

k

T

kkk

T

UUuuXXC===

=1

Tr

k

Tkkk

TVVvvXX ==

=1

Kernel matrix

Underlying basis: SVDT

p

k

Tkkk VUvuX ==

=1

Principal directions: ),,,( 21 kuuuU L=

Principal components: ),,,( 21 kvvvV L=


11/54


From PCA to spectral clusteringusing generalized eigenvectors

= j iji wd

In Kernel PCAwe compute eigenvector: vWv =

Consider the kernel matrix:

Generalized Eigenvector:

)(),( jiij xxW =

DqWq =

),,( 1 ndddiagD L=

This leads to Spectral Clustering !


12/54


Scale PCA Spectral Clustering

PCA:

2/1)/(~,~

21

21

jiijij ddwwWDDW ==

scaled principal component

Scaled PCA: DqqDDWDWk

T

kkk === 12

1

2

1 ~

kk vDq21

=

=k

T

kkk vvW


13/54


Scaled PCA on a Rectangle Matrix Correspondence Analysis

Re-scaling:2/1

.. )(

~

,

~

/21

21

jiijijcr ppppPDDP==

are scaled row and column principalcomponent (standard coordinates in CA)

Apply SVD on P~

ck

T

kkkr

T DgfDprcP ..1

/ =

=

Subtract trivial component

T

nppr ),,( ..1 L=

T

n

ppc ),,(.1.

L=kckkrk vDguDf

21

21

,

==

(Zha, et al, CIKM 2001, Ding et al, PKDD2002)


14/54


Nonnegative Matrix Factorization

),,,( 21 nxxxX L=Data Matrix: n points in p-dim:

TFGX

Decomposition(low-rank approximation)

Nonnegative Matrices 0,0,0 ijijij GFX

),,,( 21 kgggG L=),,,( 21 kfffF L=

is an image,

document,

webpage, etc

ix


15/54


Solving NMF with multiplicative updating

Fix F, solve for G; Fix G, solve for F

Lee & Seung ( 2000) propose

0,0,|||| 2 = GFFGXJ T

jk

Tjk

T

jkjkFGF

FXGG

)(

)(

ik

Tikikik

GFG

XGFF )(

)(


16/54


Matrix Factorization Summary

PCA:

Scaled PCA:

DQQDDWDWT

== 21

2

1 ~

T

VVW =

Symmetric

(kernel matrix, graph)

Rectangle Matrix(contigency table, bipartite graph)

TVUX =

cT

rcr DGFDDXDX ==2

1

2

1 ~

TFGX NMF: TQQW


17/54


Indicator Matrix Quadratic Clustering

Unsigned Cluster indicator Matrix H=(h1,, hK)

0,..),Tr(max = HIHHtsWHHTT

H

;XXW T=

Kernel K-means clustering:

Spectral clustering (normalized cut)

K-means: ))(),(( >


18/54


Indicator Matrix Quadratic ClusteringAdditional features:

)Tr(max HCWHHTT

H+

.,)(

2/)(CHWHH

H

CWHHH

TT

ik

ikikikik +=

+

Semi-suerpvised classification:

Semi-supervised clustering: (A) must-link and (B) cannot-link constraints

allowing zero rows in HOutlier Detection:

)Tr(max BHHAHHWHH TTTH

+

)Tr(max WHHT

H

Nonnegative Lagrangian Relaxation:


19/54


Tutorial Outline PCA

Recent developments on PCA/SVD

Equivalence to K-means clustering

Scaled PCA

Laplacian matrix Spectral clustering

Spectral ordering

Nonnegative Matrix Factorization

Equivalence to K-means clustering Holistic vs. Parts-based

Indicator Matrix Quadratic Clustering

Use Nonnegative Lagrangian Relaxtion

Includes K-means and Spectral Clustering

semi-supervised classification

Semi-supervised clustering

Outlier detection


20/54

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20

Part 1.B.Recent Developments on PCA and SVD

Principal CurvesIndependent Component Analysis

Kernel PCA

Mixture of PCA (probabilistic PCA)

Sparse PCA/SVD

Semi-discrete, truncation, L1 constraint, Directsparsification

Column Partitioned Matrix Factorizations2D-PCA/SVD

Equivalence to K-means clustering


21/54


PCA and SVD

),,,( 21 nxxxX L=Data Matrix:

Covariance

Principal directions:

(Principal axis,subspace)ku Principal components:

(projection on the subspace)kv

=

==p

k

T

kkk

TuuXXC

1

==

r

k

T

kkk

TvvXX

1

Gram (kernel) matrix

Underlying basis: SVD =

=

p

k

T

kkk vuX1


22/54


Kernel PCA

Kernel

Feature extraction

Indefinite Kernels

Generalization to graphs with nonnegative weights

)(),( jiij xxK =

(Scholkopf, Smola, Muller, 1996)

)(),()(, xxvxv iii

=

PCA Component

v

)( ii xx


23/54


Mixture of PCA Data has local structures.

Global PCA on all data is not useful

Clustering PCA (Hinton et al): Using clustering to cluster data into clusters

Perform PCA in each cluster

No explicit generative model Probabilistic PCA (Tipping & Bishop)

Latent variables

Generative model (Gaussian) Mixture of Gaussians mixture of PCA

Adding Markov dynamics for latent variables (LinearGaussian Models)


24/54


Probabilistic PCA

Linear Gaussian Model

),0(~, 2INWsx ii ++=

Latent variables ),,( 1 nssS L=

),(~)(2

0 IsNsP sGaussian prior

),(~ 20T

sWWIWsNx +

(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)

Linear Gaussian Model

,,1 +=+=+ iiii WsxAss


25/54


Sparse PCA Compute a factorization

Uor Vis sparse or both are sparse

Why sparse?

Variable selection (sparse U)

When n >> d

Storage saving

Other new reasons?

L1 and L2 constraints

TUVX


26/54


Sparse PCA: Truncation andDiscretization

Sparsified SVD

Compute {uk,vk} one at a time, truncate those entriesbelow a threshold.

Recursively compute all pairs using deflation. (Zhang, Zha, Simon, 2002)

Semi-discrete decomposition

U, Vonly contains {-1, 0, 1}

Iterative algorithm to compute U,V using deflation (Kolda & Oleary, 1999)

TVUX

TuvXX

)( 1 kuuU L= )( 1 kvvV L=


27/54


Sparse PCA: L1 constraint

LASSO (Tibshirani, 1996)

SCoTLASS (Joliffe & Uddin, 2003)

Least Angle Regression (Efron, et al 2004)

Sparse PCA (Zou, Hastie, Tibshirani,2004)

tXy T 12 ||||,||||min

0,||||,)(max 1 = hTTTT uutuuXXu

IxxT

k

j

jj

k

j

jiT

n

i

i =++ ===

,||||||||||||min

1

1,1

1

22

1,

||||/ jjjv =


28/54


Sparse PCA: Direct Sparsification

Sparse SVD with explicit sparsification

rank-one approximation

Minimize a bound

deflation

Direct sparse PCA, on covariance matrix S

)nnz()nnz(||||min,

vuudvX FTvu ++

)Tr(max)Tr(maxmax SUSuuSuuu TT ===

1)rank(,0,)nnz(,1)Tr(.. 2 == UUkUUts f

(Zhang, Zha, Simon 2003)

(DAspremont, Gharoui, Jordan,Lancriet, 2004)


29/54


Sparse PCA Summary Many different approaches

Truncation, discretization

L1 Constraint

Direct sparsification

Other approaches

Sparse Matrix factorization in general

L1 constraint

Many questions Orthogonality

Unique solution, global solution


30/54


PCA: Further Generalizations

Generalization to Exponential Family (Collins, Dasgupta, Schapire, 2001)

Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)

Collaborative filtering

Input Y is binary Hard margin

Soft margin

+

Sia

iaiaXYcX )1,0max(||||min

)||||||(||||||, 222

1

FroFroT

VUXUVX +==

SiaXY iaia ,1


31/54


Column Partitioned Matrix Factorizations

Column Partitioned Data Matrix

Partitions are generate by clustering

Centroid matrix

uk is centroid

Fix U, compute V

Represent each partition by a SVD.

Pick leading Us to form U

Fix U, compute V

Several other variations

1)( = UUUXVTT2||||min F

TUVX

)(1 k

uuU L=

),,,(),( 1111 1

2

21

1

1

4484476LL

48476L

48476LL

k

k

n

nn

n

nn

n

nn xxxxxxxxX ++ ==

nnn k =++L1

),,(),( )()(

1

)1(

1

)1(

111

48476

LL

48476

LL

l

l

l

ll

k

k

k

kuuuuUUU ==

(Zhang & Zha, 2001)

(Castelli, Thomasian & Li 2003)

(Park, Jeon & Rosen, 2003)

(Dhillon & Modha, 2001)

(Zeimpekis & Gallopoulos, 2004)


32/54


Two-dimensional SVD Large number of data objects are 2-D: images, maps

Standard method: convert (re-order) each image as a 1D vector

collect all 1D vectors into a single (big) matrix

apply SVD on the big matrix

2D-SVD is developed for 2D objects

Extension of standard SVD

Keeping the 2D characteristics

Improves quality of low-dimensional approximation Reduces computation, storage


33/54


0 0

05

0 7

10

08

0 2

0 0

.

.

.

.

.

.

.

M

Pixel vector

Linearize a 2D object into 1D object


34/54


SVD and 2D-SVD

SVD

VXUT

=TVUX =

),,,( 21 nxxxX L=

Eigenvectors of TXX XXTand

},,,{}{ 21 nAAAA L=Eigenvectors of

2D-SVD

Tii

i

AAAAF ))(( =)()( AAAAG i

Ti

i

=T

ii

VUMA = VAUM iT

i =

row-row covariance

column-column cov


35/54


2D-SVD

},,,{}{ 21 nAAAA L= assume 0=A

==Tkkk

Tii

i

uuAAF

=

==1k

T

kki

T

ii

kuuAAG

VAUM iT

i =

row-row cov:

col-col cov:

),,,( 21 kuuuU L=

),,,( 21 kvvvV L=

niVUMAT

ii ,,1, L==

Bilinear

subspace

kk

i

kckrcr

i MVUA

,,,

2D SVD E A l i


36/54


2D-SVD Error Analysis

+====

r

kj

jT

ii

n

i

RMAJ

1

2

1

2 ||||min

+=+== +=c

kj

j

r

kj

jT

ii

n

iRLMAJ

11

2

1

3 ||||min

+==

==

c

kj

jii

n

i

LMAJ

1

2

1

1 ||||min

kki

kckrcri

Tii RMRRRLRARLMA

,,,,

+==

=

r

kj

jT

ii

n

i

LLMAJ

1

2

1

4 2||||min

+==p

kiiTVUX

1

22||||min SVD:


37/54


Temperature maps (January over 100 years)

Reconstruction

Errors

SVD/2DSVD=1.1

Storages

SVD/2DSVD=8


38/54


Reconstructed image

SVD (K=15), storage 160560

2DSVD (K=15), storage 93060

SVD

2dSVD


39/54


2D-SVD Summary

2DSVD is extension of standard SVD

Provides optimal solution for 4 representations for

2D images/maps Substantial improvements in storage, computation,

quality of reconstruction

Capture 2D characteristics


40/54

40PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Part 1.C.

K-means Clustering Principal Component Analysis

(Equivalence between PCA and K-means)


41/54

41PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

K-means clustering

Also called isodata, vector quantization

Developed in 1960s (Lloyd, MacQueen, Hatigan,etc)

Computationally Efficient (order-mN)

Widely used in practice Benchmark to evaluate other algorithms

=

=

kCi

ki

K

k

K cxJ2

1

||||min

T

nxxxX ),,,( 21 L=Given n points in m-dim:

K-means objective


42/54

42

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

PCA is equivalent to K-means

Continuous optimal solution for clusterindicators in K-means clustering aregiven by principal components.

Subspace spanned by Kcluster centroidsis given by PCA subspace.


43/54


44/54

44


A simple illustration


45/54

45


DNA Gene Expression File for Leukemia

Using v1 , tissue

samples separatedinto 2 clusters, 3errors

Do one more K-means, reduce to 1error


46/54

46


Multi-way K-means Clustering

Unsigned Cluster membership indicators h1,, hK:

),,(

1

0

0

0

0

1

0

0

0

0

1

1

321 hhh=

C1 C2 C3


47/54

47


Multi-way K-means Clustering

=

=

i

K

k

Cjij

Ti

k

iKk

xxn

xJ

1

,

2 1

(Unsigned) Cluster indicators H=(h1,, hK)

)(Tr2k

TT

ki

iK

XHXHxJ =

=

=i

K

k

kTT

ki XhXhx

1

2

THQ kk=

Redundancy: =

=

K

k

kk ehn

1

2/1

Regularized Relaxation

Transform h1, , hK to q1 - qkvia orthogonal matrix T

Thhqq kk ),,(),...,( 11 L=2/1

1 /neq =


48/54


49/54

49


Consistency:2-way and K-way approaches

Orthogonal Transform:

Recover the original 2-way cluster indicator

Ttransforms (h1, h2) to (q1,q2):

Tbbaaq ),,,,,(, 2 = LL

Tq )11(1 L=

Th )11,00(, 2 LL=Th )00,11(1 LL= nnna 12=nn

nb

2

1=

=

nnnn

nnnnT

//

//

21

12

Test of Lower bounds of K means clustering


50/54

50


Lower bound is within 0.6-1.5% of the optimal value

Test of Lower bounds of K-means clustering

opt

LBopt

JJJ ||

Cl t S b ( d b t id )


51/54

51


====k

Tkkk

T

k

Tkk

T

k

Tkk

k

Tkk uuXvvXXhhXccP

Cluster Subspace (spanned by K centroids)

= PCA Subspace

Given a data point x,

=

k

TkkccP project x into the cluster subspace

k

k

ikk Xhxihc == )(Centroid is given by

PCA

k

Tkk

k

TkkkmeansK PuuuuP =

PCA automatically project into cluster subspace

PCA is unsupervised version of LDA


52/54

52


Effectiveness of PCA Dimension Reduction

l Cl


53/54

53


Kernel K-means Clustering

==

kCi

ki

K

k

K cxJ 2

1

||)()(||min

Kernal K-means objective: )( ii xx

Kernal K-means

=

=

K

k Cji

jT

i

ki

i

k

xx

n

x

1 ,

2 )()(1

|)(|

=

=

K

k Cji

ji

k

K

k

xx

n

J1 ,

)(),(1

max


54/54

54


Kernel K-means clusteringis equivalent to Kernal PCA

Continuous optimal solution for clusterindicators are given by Kernal PCAcomponents

Subspace spanned by K cluster centroidsare given by Kernal PCAprincipal subspace

PCA Tutor1

Documents

Transcript of PCA Tutor1