PCA Tutor1
-
Upload
oeztuerk-goekal -
Category
Documents
-
view
231 -
download
9
Transcript of PCA Tutor1
-
7/31/2019 PCA Tutor1
1/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1
Principal Component Analysis andMatrix Factorizations for Learning
Chris DingLawrence Berkeley National Laboratory
Supported by Office of Science, U.S. Dept. of Energy
-
7/31/2019 PCA Tutor1
2/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 2
Many unsupervised learning methods
are closely related in a simple way
SpectralClustering
NMF
K-meansclustering
PCA
Indicator Matrix
Quadratic Clustering
Semi-supervised
classification
Semi-supervised
clustering
Outlier detection
-
7/31/2019 PCA Tutor1
3/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 3
Part 1.A.Principal Component Analysis (PCA)
andSingular Value Decomposition (SVD)
Widely used in large number of different fields
Most widely known as PCA (multivariate
statistics)
SVD is the theoretical basis for PCA
-
7/31/2019 PCA Tutor1
4/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 4
Brief history
PCA Draw a plane closest to data points (Pearson, 1901) Retain most variance (Hotelling, 1933)
SVD Low-rank approximation (Eckart-Young, 1936)
Practical application/Efficient Computation (Golub-Kahan, 1965)
Many generalizations
-
7/31/2019 PCA Tutor1
5/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 5
PCA and SVD),,,( 21 nxxxX L=Data: n points in p-dim:
Covariance
Principal directions:
(Principal axis,subspace)
ku Principal components:
(projection on the subspace)
kv
=
==p
k
T
kkk
T uuXXC1
==
r
k
T
kkk
T
vvXX1 Gram (kernel) matrix
Underlying basis: SVDT
p
k
Tkkk VUvuX ==
=1
-
7/31/2019 PCA Tutor1
6/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 6
Further Developments
SVD/PCA
Principal Curves Independent Component Analysis
Sparse SVD/PCA (many approaches)
Mixture of Probabilistic PCA Generalization to exponential familty, max-margin
Connection to K-means clustering
Kernel (inner-product) Kernel PCA
-
7/31/2019 PCA Tutor1
7/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 7
Methods of PCA Utilization
dkkk XduXuu ++= )()1( 1 L
Principal components(uncorrelated random variables):
Projection to low-dim
subspace
Sphereing the data
Transform data to N(0,1)
Dimension reduction: T
p
k
Tkkk VUvuX ===1
),,,( 21 nxxxX L=
XUX
T=
~ ),,(1 k
uuU L=
XUUXCXT12/1~
==
-
7/31/2019 PCA Tutor1
8/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 8
Applications of PCA/SVD
Most popular in multivariate statistics
Image processing, signal processing
Physics: principal axis, diagonalization of2nd tensor (mass)
Climate: Empirical Orthogonal Functions(EOF)
Kalman filter. Reduced order analysis
TttttAPAPEsAs )()1()()1( , =+= ++
-
7/31/2019 PCA Tutor1
9/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 9
Applications of PCA/SVD
PCA/SVD is as widely as Fast Fourier Transforms
Both are spectral expansions FFT is more on Partial Differential Equations
PCA/SVD is more on discrete (data) analysis
PCA/SVD surpass FFT as computational sciencesfurther advance
PCA/SVD
Select combination of variables Dimension reduction
An image has 104 pixels. True dimension is 20 !
-
7/31/2019 PCA Tutor1
10/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 10
PCAis a Matrix Factorization(spectral/eigen decomposition)
Covariance
Tp
k
T
kkk
T
UUuuXXC===
=1
Tr
k
Tkkk
TVVvvXX ==
=1
Kernel matrix
Underlying basis: SVDT
p
k
Tkkk VUvuX ==
=1
Principal directions: ),,,( 21 kuuuU L=
Principal components: ),,,( 21 kvvvV L=
-
7/31/2019 PCA Tutor1
11/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 11
From PCA to spectral clusteringusing generalized eigenvectors
= j iji wd
In Kernel PCAwe compute eigenvector: vWv =
Consider the kernel matrix:
Generalized Eigenvector:
)(),( jiij xxW =
DqWq =
),,( 1 ndddiagD L=
This leads to Spectral Clustering !
-
7/31/2019 PCA Tutor1
12/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 12
Scale PCA Spectral Clustering
PCA:
2/1)/(~,~
21
21
jiijij ddwwWDDW ==
scaled principal component
Scaled PCA: DqqDDWDWk
T
kkk === 12
1
2
1 ~
kk vDq21
=
=k
T
kkk vvW
-
7/31/2019 PCA Tutor1
13/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 13
Scaled PCA on a Rectangle Matrix Correspondence Analysis
Re-scaling:2/1
.. )(
~
,
~
/21
21
jiijijcr ppppPDDP==
are scaled row and column principalcomponent (standard coordinates in CA)
Apply SVD on P~
ck
T
kkkr
T DgfDprcP ..1
/ =
=
Subtract trivial component
T
nppr ),,( ..1 L=
T
n
ppc ),,(.1.
L=kckkrk vDguDf
21
21
,
==
(Zha, et al, CIKM 2001, Ding et al, PKDD2002)
-
7/31/2019 PCA Tutor1
14/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 14
Nonnegative Matrix Factorization
),,,( 21 nxxxX L=Data Matrix: n points in p-dim:
TFGX
Decomposition(low-rank approximation)
Nonnegative Matrices 0,0,0 ijijij GFX
),,,( 21 kgggG L=),,,( 21 kfffF L=
is an image,
document,
webpage, etc
ix
-
7/31/2019 PCA Tutor1
15/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 15
Solving NMF with multiplicative updating
Fix F, solve for G; Fix G, solve for F
Lee & Seung ( 2000) propose
0,0,|||| 2 = GFFGXJ T
jk
Tjk
T
jkjkFGF
FXGG
)(
)(
ik
Tikikik
GFG
XGFF )(
)(
-
7/31/2019 PCA Tutor1
16/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 16
Matrix Factorization Summary
PCA:
Scaled PCA:
DQQDDWDWT
== 21
2
1 ~
T
VVW =
Symmetric
(kernel matrix, graph)
Rectangle Matrix(contigency table, bipartite graph)
TVUX =
cT
rcr DGFDDXDX ==2
1
2
1 ~
TFGX NMF: TQQW
-
7/31/2019 PCA Tutor1
17/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 17
Indicator Matrix Quadratic Clustering
Unsigned Cluster indicator Matrix H=(h1,, hK)
0,..),Tr(max = HIHHtsWHHTT
H
;XXW T=
Kernel K-means clustering:
Spectral clustering (normalized cut)
K-means: ))(),(( >
-
7/31/2019 PCA Tutor1
18/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 18
Indicator Matrix Quadratic ClusteringAdditional features:
)Tr(max HCWHHTT
H+
.,)(
2/)(CHWHH
H
CWHHH
TT
ik
ikikikik +=
+
Semi-suerpvised classification:
Semi-supervised clustering: (A) must-link and (B) cannot-link constraints
allowing zero rows in HOutlier Detection:
)Tr(max BHHAHHWHH TTTH
+
)Tr(max WHHT
H
Nonnegative Lagrangian Relaxation:
-
7/31/2019 PCA Tutor1
19/54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 19
Tutorial Outline PCA
Recent developments on PCA/SVD
Equivalence to K-means clustering
Scaled PCA
Laplacian matrix Spectral clustering
Spectral ordering
Nonnegative Matrix Factorization
Equivalence to K-means clustering Holistic vs. Parts-based
Indicator Matrix Quadratic Clustering
Use Nonnegative Lagrangian Relaxtion
Includes K-means and Spectral Clustering
semi-supervised classification
Semi-supervised clustering
Outlier detection
-
7/31/2019 PCA Tutor1
20/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20
Part 1.B.Recent Developments on PCA and SVD
Principal CurvesIndependent Component Analysis
Kernel PCA
Mixture of PCA (probabilistic PCA)
Sparse PCA/SVD
Semi-discrete, truncation, L1 constraint, Directsparsification
Column Partitioned Matrix Factorizations2D-PCA/SVD
Equivalence to K-means clustering
-
7/31/2019 PCA Tutor1
21/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 21
PCA and SVD
),,,( 21 nxxxX L=Data Matrix:
Covariance
Principal directions:
(Principal axis,subspace)ku Principal components:
(projection on the subspace)kv
=
==p
k
T
kkk
TuuXXC
1
==
r
k
T
kkk
TvvXX
1
Gram (kernel) matrix
Underlying basis: SVD =
=
p
k
T
kkk vuX1
-
7/31/2019 PCA Tutor1
22/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 22
Kernel PCA
Kernel
Feature extraction
Indefinite Kernels
Generalization to graphs with nonnegative weights
)(),( jiij xxK =
(Scholkopf, Smola, Muller, 1996)
)(),()(, xxvxv iii
=
PCA Component
v
)( ii xx
-
7/31/2019 PCA Tutor1
23/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 23
Mixture of PCA Data has local structures.
Global PCA on all data is not useful
Clustering PCA (Hinton et al): Using clustering to cluster data into clusters
Perform PCA in each cluster
No explicit generative model Probabilistic PCA (Tipping & Bishop)
Latent variables
Generative model (Gaussian) Mixture of Gaussians mixture of PCA
Adding Markov dynamics for latent variables (LinearGaussian Models)
-
7/31/2019 PCA Tutor1
24/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 24
Probabilistic PCA
Linear Gaussian Model
),0(~, 2INWsx ii ++=
Latent variables ),,( 1 nssS L=
),(~)(2
0 IsNsP sGaussian prior
),(~ 20T
sWWIWsNx +
(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)
Linear Gaussian Model
,,1 +=+=+ iiii WsxAss
-
7/31/2019 PCA Tutor1
25/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 25
Sparse PCA Compute a factorization
Uor Vis sparse or both are sparse
Why sparse?
Variable selection (sparse U)
When n >> d
Storage saving
Other new reasons?
L1 and L2 constraints
TUVX
-
7/31/2019 PCA Tutor1
26/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 26
Sparse PCA: Truncation andDiscretization
Sparsified SVD
Compute {uk,vk} one at a time, truncate those entriesbelow a threshold.
Recursively compute all pairs using deflation. (Zhang, Zha, Simon, 2002)
Semi-discrete decomposition
U, Vonly contains {-1, 0, 1}
Iterative algorithm to compute U,V using deflation (Kolda & Oleary, 1999)
TVUX
TuvXX
)( 1 kuuU L= )( 1 kvvV L=
-
7/31/2019 PCA Tutor1
27/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 27
Sparse PCA: L1 constraint
LASSO (Tibshirani, 1996)
SCoTLASS (Joliffe & Uddin, 2003)
Least Angle Regression (Efron, et al 2004)
Sparse PCA (Zou, Hastie, Tibshirani,2004)
tXy T 12 ||||,||||min
0,||||,)(max 1 = hTTTT uutuuXXu
IxxT
k
j
jj
k
j
jiT
n
i
i =++ ===
,||||||||||||min
1
1,1
1
22
1,
||||/ jjjv =
-
7/31/2019 PCA Tutor1
28/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 28
Sparse PCA: Direct Sparsification
Sparse SVD with explicit sparsification
rank-one approximation
Minimize a bound
deflation
Direct sparse PCA, on covariance matrix S
)nnz()nnz(||||min,
vuudvX FTvu ++
)Tr(max)Tr(maxmax SUSuuSuuu TT ===
1)rank(,0,)nnz(,1)Tr(.. 2 == UUkUUts f
(Zhang, Zha, Simon 2003)
(DAspremont, Gharoui, Jordan,Lancriet, 2004)
-
7/31/2019 PCA Tutor1
29/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 29
Sparse PCA Summary Many different approaches
Truncation, discretization
L1 Constraint
Direct sparsification
Other approaches
Sparse Matrix factorization in general
L1 constraint
Many questions Orthogonality
Unique solution, global solution
-
7/31/2019 PCA Tutor1
30/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 30
PCA: Further Generalizations
Generalization to Exponential Family (Collins, Dasgupta, Schapire, 2001)
Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)
Collaborative filtering
Input Y is binary Hard margin
Soft margin
+
Sia
iaiaXYcX )1,0max(||||min
)||||||(||||||, 222
1
FroFroT
VUXUVX +==
SiaXY iaia ,1
-
7/31/2019 PCA Tutor1
31/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 31
Column Partitioned Matrix Factorizations
Column Partitioned Data Matrix
Partitions are generate by clustering
Centroid matrix
uk is centroid
Fix U, compute V
Represent each partition by a SVD.
Pick leading Us to form U
Fix U, compute V
Several other variations
1)( = UUUXVTT2||||min F
TUVX
)(1 k
uuU L=
),,,(),( 1111 1
2
21
1
1
4484476LL
48476L
48476LL
k
k
n
nn
n
nn
n
nn xxxxxxxxX ++ ==
nnn k =++L1
),,(),( )()(
1
)1(
1
)1(
111
48476
LL
48476
LL
l
l
l
ll
k
k
k
kuuuuUUU ==
(Zhang & Zha, 2001)
(Castelli, Thomasian & Li 2003)
(Park, Jeon & Rosen, 2003)
(Dhillon & Modha, 2001)
(Zeimpekis & Gallopoulos, 2004)
-
7/31/2019 PCA Tutor1
32/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 32
Two-dimensional SVD Large number of data objects are 2-D: images, maps
Standard method: convert (re-order) each image as a 1D vector
collect all 1D vectors into a single (big) matrix
apply SVD on the big matrix
2D-SVD is developed for 2D objects
Extension of standard SVD
Keeping the 2D characteristics
Improves quality of low-dimensional approximation Reduces computation, storage
-
7/31/2019 PCA Tutor1
33/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 33
0 0
05
0 7
10
08
0 2
0 0
.
.
.
.
.
.
.
M
Pixel vector
Linearize a 2D object into 1D object
-
7/31/2019 PCA Tutor1
34/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 34
SVD and 2D-SVD
SVD
VXUT
=TVUX =
),,,( 21 nxxxX L=
Eigenvectors of TXX XXTand
},,,{}{ 21 nAAAA L=Eigenvectors of
2D-SVD
Tii
i
AAAAF ))(( =)()( AAAAG i
Ti
i
=T
ii
VUMA = VAUM iT
i =
row-row covariance
column-column cov
-
7/31/2019 PCA Tutor1
35/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 35
2D-SVD
},,,{}{ 21 nAAAA L= assume 0=A
==Tkkk
Tii
i
uuAAF
=
==1k
T
kki
T
ii
kuuAAG
VAUM iT
i =
row-row cov:
col-col cov:
),,,( 21 kuuuU L=
),,,( 21 kvvvV L=
niVUMAT
ii ,,1, L==
Bilinear
subspace
kk
i
kckrcr
i MVUA
,,,
2D SVD E A l i
-
7/31/2019 PCA Tutor1
36/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 36
2D-SVD Error Analysis
+====
r
kj
jT
ii
n
i
RMAJ
1
2
1
2 ||||min
+=+== +=c
kj
j
r
kj
jT
ii
n
iRLMAJ
11
2
1
3 ||||min
+==
==
c
kj
jii
n
i
LMAJ
1
2
1
1 ||||min
kki
kckrcri
Tii RMRRRLRARLMA
,,,,
+==
=
r
kj
jT
ii
n
i
LLMAJ
1
2
1
4 2||||min
+==p
kiiTVUX
1
22||||min SVD:
-
7/31/2019 PCA Tutor1
37/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 37
Temperature maps (January over 100 years)
Reconstruction
Errors
SVD/2DSVD=1.1
Storages
SVD/2DSVD=8
-
7/31/2019 PCA Tutor1
38/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 38
Reconstructed image
SVD (K=15), storage 160560
2DSVD (K=15), storage 93060
SVD
2dSVD
-
7/31/2019 PCA Tutor1
39/54
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 39
2D-SVD Summary
2DSVD is extension of standard SVD
Provides optimal solution for 4 representations for
2D images/maps Substantial improvements in storage, computation,
quality of reconstruction
Capture 2D characteristics
-
7/31/2019 PCA Tutor1
40/54
40PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Part 1.C.
K-means Clustering Principal Component Analysis
(Equivalence between PCA and K-means)
-
7/31/2019 PCA Tutor1
41/54
41PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
K-means clustering
Also called isodata, vector quantization
Developed in 1960s (Lloyd, MacQueen, Hatigan,etc)
Computationally Efficient (order-mN)
Widely used in practice Benchmark to evaluate other algorithms
=
=
kCi
ki
K
k
K cxJ2
1
||||min
T
nxxxX ),,,( 21 L=Given n points in m-dim:
K-means objective
-
7/31/2019 PCA Tutor1
42/54
42
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
PCA is equivalent to K-means
Continuous optimal solution for clusterindicators in K-means clustering aregiven by principal components.
Subspace spanned by Kcluster centroidsis given by PCA subspace.
-
7/31/2019 PCA Tutor1
43/54
-
7/31/2019 PCA Tutor1
44/54
44
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
A simple illustration
-
7/31/2019 PCA Tutor1
45/54
45
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
DNA Gene Expression File for Leukemia
Using v1 , tissue
samples separatedinto 2 clusters, 3errors
Do one more K-means, reduce to 1error
-
7/31/2019 PCA Tutor1
46/54
46
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Multi-way K-means Clustering
Unsigned Cluster membership indicators h1,, hK:
),,(
1
0
0
0
0
1
0
0
0
0
1
1
321 hhh=
C1 C2 C3
-
7/31/2019 PCA Tutor1
47/54
47
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Multi-way K-means Clustering
=
=
i
K
k
Cjij
Ti
k
iKk
xxn
xJ
1
,
2 1
(Unsigned) Cluster indicators H=(h1,, hK)
)(Tr2k
TT
ki
iK
XHXHxJ =
=
=i
K
k
kTT
ki XhXhx
1
2
THQ kk=
Redundancy: =
=
K
k
kk ehn
1
2/1
Regularized Relaxation
Transform h1, , hK to q1 - qkvia orthogonal matrix T
Thhqq kk ),,(),...,( 11 L=2/1
1 /neq =
-
7/31/2019 PCA Tutor1
48/54
-
7/31/2019 PCA Tutor1
49/54
49
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Consistency:2-way and K-way approaches
Orthogonal Transform:
Recover the original 2-way cluster indicator
Ttransforms (h1, h2) to (q1,q2):
Tbbaaq ),,,,,(, 2 = LL
Tq )11(1 L=
Th )11,00(, 2 LL=Th )00,11(1 LL= nnna 12=nn
nb
2
1=
=
nnnn
nnnnT
//
//
21
12
Test of Lower bounds of K means clustering
-
7/31/2019 PCA Tutor1
50/54
50
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Lower bound is within 0.6-1.5% of the optimal value
Test of Lower bounds of K-means clustering
opt
LBopt
JJJ ||
Cl t S b ( d b t id )
-
7/31/2019 PCA Tutor1
51/54
51
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
====k
Tkkk
T
k
Tkk
T
k
Tkk
k
Tkk uuXvvXXhhXccP
Cluster Subspace (spanned by K centroids)
= PCA Subspace
Given a data point x,
=
k
TkkccP project x into the cluster subspace
k
k
ikk Xhxihc == )(Centroid is given by
PCA
k
Tkk
k
TkkkmeansK PuuuuP =
PCA automatically project into cluster subspace
PCA is unsupervised version of LDA
-
7/31/2019 PCA Tutor1
52/54
52
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Effectiveness of PCA Dimension Reduction
l Cl
-
7/31/2019 PCA Tutor1
53/54
53
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Kernel K-means Clustering
==
kCi
ki
K
k
K cxJ 2
1
||)()(||min
Kernal K-means objective: )( ii xx
Kernal K-means
=
=
K
k Cji
jT
i
ki
i
k
xx
n
x
1 ,
2 )()(1
|)(|
=
=
K
k Cji
ji
k
K
k
xx
n
J1 ,
)(),(1
max
-
7/31/2019 PCA Tutor1
54/54
54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Kernel K-means clusteringis equivalent to Kernal PCA
Continuous optimal solution for clusterindicators are given by Kernal PCAcomponents
Subspace spanned by K cluster centroidsare given by Kernal PCAprincipal subspace