PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...
Transcript of PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...
PCA and LDA
Nuno Vasconcelos ECE Department, UCSDp ,
Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D 2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids
2
Principal component analysisIf y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y2
φ• principal components φi are the eigenvectors of Σ
• principal lengths λi are the λ1λ2
y
φ1φ2
p p g ieigenvalues of Σ
by computing the eigenvalues we know if the data is flat
y1
by computing the eigenvalues we know if the data is flatλ1 >> λ2: flat λ1=λ2: not flat
y2 y2
λ1λ2y1 λ1
λ2
y1
3
y1 λ1 y1
Principal component analysis (learning)
4
Principal component analysis
5
Principal component analysisthere is an alternative manner to compute the principal components, based on singular value decompositionp , g pSVD:• any real n x m matrix (n>m) can be decomposed as
TA ΜΠΝ=
• where M is a n x m column orthonormal matrix of left singular vectors (columns of M)
• Π a m x m diagonal matrix of singular values• NT a m x m row orthonormal matrix of right singular vectors
(columns of N)
II TT ΝΝΜΜ
6
II TT =ΝΝ=ΜΜ
PCA by SVDto relate this to PCA, we consider the data matrix
⎤⎡ ||
⎥⎥⎥⎤
⎢⎢⎢⎡
=||
1 nxxX K
the sample mean is
⎥⎦⎢⎣ ||p
111||
11 X⎥⎤
⎢⎡⎥⎤
⎢⎡
∑ M 11||
1 Xn
xxn
xn n
ii =
⎥⎥⎥
⎦⎢⎢⎢
⎣⎥⎥⎥
⎦⎢⎢⎢
⎣
== ∑ MKµ
7
PCA by SVDand we can center the data by subtracting the mean to each column of Xthis is the centered data matrix
⎤⎡⎤⎡ ||||
⎥⎥⎥⎤
⎢⎢⎢⎡
−⎥⎥⎥⎤
⎢⎢⎢⎡
= nc xxX||
||
||
||
1 µµ KK
⎞⎜⎛===
⎥⎦⎢⎣⎥⎦⎢⎣
TTT IXXXX 1111111
||||
µ⎠
⎜⎝
−=−=−=n
IXXn
XX 11111µ
8
PCA by SVDthe sample covariance is
11 ( )( ) ( )∑∑ =−−=Σi
Tci
ci
i
Tii xx
nxx
n11 µµ
where xic is the ith column of Xc
this can be written as
T
c
cc XXx
1||
1 1⎥⎤
⎢⎡ −−⎥⎤
⎢⎡
Σ M Tcc
cn
cn
c XXn
xxx
n||
1 =⎥⎥⎥
⎦⎢⎢⎢
⎣ −−⎥⎥⎥
⎦⎢⎢⎢
⎣
=Σ MK
9
PCA by SVDthe matrix
⎥⎤
⎢⎡ −− cx 1
⎥⎥⎥
⎦⎢⎢⎢
⎣ −−=
cn
Tc
xX M
is real n x d. Assuming n > d it has SVD decomposition
⎦⎣ n
and
TTcX ΜΠΝ= II TT =ΝΝ=ΜΜ
and
TTTTXX ΝΝΠ=ΜΠΝΝΠΜ==Σ 2111
10
cc nnXX
nΝΝΠΜΠΝΝΠΜΣ
PCA by SVD
T
nΝ⎟⎠⎞
⎜⎝⎛ ΠΝ=Σ 21
noting that N is d x d and orthonormal, and Π2 diagonal, shows that this is just the eigenvalue decomposition of Σ
n ⎠⎝
j g pit follows that • the eigenvectors of Σ are the columns of N• the eigenvalues of Σ are
21i πλ =
this gives an alternative algorithm for PCA
ini πλ
11
PCA by SVDcomputation of PCA by SVDgiven X with one example per columngiven X with one example per column• 1) create the centered data-matrix
⎞⎛ 1
2) t it SVD
TTTc X
nIX ⎟
⎠⎞
⎜⎝⎛ −= 111
• 2) compute its SVD
TTcX ΜΠΝ=
• 3) principal components are columns of N, eigenvalues are
21
12
21ini πλ =
Limitations of PCAPCA is not optimal for classification• note that there is no mention of the class label in the definition ofnote that there is no mention of the class label in the definition of
PCA• keeping the dimensions of largest energy (variance) is a good
idea but not always enoughidea, but not always enough• certainly improves the density estimation, since space has
smaller dimension• but could be unwise from a classification point of view• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is theit is not hard to construct examples where PCA is the worst possible thing we could do
13
Exampleconsider a problem with• two n-D Gaussian classes with covariance Σ=σ2I σ2 = 10• two n-D Gaussian classes with covariance Σ=σ I, σ = 10
• we add an extra variable which is the class label itself
)10,(~ INX iµ• we add an extra variable which is the class label itself
• assuming that P (0)=P (1)=0 5
] ,[' iXX =• assuming that PY(0)=PY(1)=0.5
5.015.005.0][ =×+×=YE)501(50)500(50]var[ 22 ×+×Y
• dimension n+1 has the smallest variance and is the first to be 10125.0
)5.01(5.0)5.00(5.0]var[<=
−×+−×=Y
14
discarded!
Examplethis is• a very contrived examplea very contrived example• but shows that PCA can throw away all the discriminant info
does this mean you should never use PCA?y• no, typically it is a good method to find a suitable subset of
variables, as long as you are not too greedy• e g if you start with n = 100 and know that there are only 5• e.g. if you start with n = 100, and know that there are only 5
variables of interest• picking the top 20 PCA components is likely to keep the desired 5• your classifier will be much better than for n = 100, probably
not much worse than the one with the best 5 features
is there a rule of thumb for finding the number of PCA
15
is there a rule of thumb for finding the number of PCA components?
Principal component analysisa natural measure is to pick the eigenvectors that explain p % of the data variabilityp y• can be done by plotting the ratio rk as a function of k
∑=
k
ii
r 1
2λ
∑=
== n
ii
ikr
1
2
1
λ
• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset
16
dataset
Fischer’s linear discriminantwhat if we really need to find the best features?• harder question, usually impossible with simple methods• there are better methods at finding discriminant directions
one good example is linear discriminant analysis (LDA)• the idea is to find the line that best separates the two classes
bad projectionbad projection
17
good projection
Linear discriminant analysiswe have two classes such that
[ ]iYXE |[ ] iYX iYXE µ==||
( )( )[ ] iT
iiYX iYXXE Σ==−− || µµ
and want to find the line
( )( )[ ] iiiYX || µµ
xwz T=that best separates themone possibility would be to maximize
xwz =
one possibility would be to maximize
[ ] [ ]( )2|| 0|1| ==−= YZYZ YZEYZE
18
[ ] [ ]( ) [ ]( )2012
|| 0|1| µµ −==−= TTYX
TYX wYxwEYxwE
Linear discriminant analysishowever, this [ ]( )201 µµ −Tw
can be made arbitrarily large by simply scaling wwe are only interested in the direction, not the magnitudeneed some type of normalizationFischer suggested
max =scatterclasswithinscatterclassbetween
[ ] [ ]( )[ ] [ ]0|var1|var
0|1|max
2
||
=+==−=
YZYZYZEYZE
scatterclasswithin
YZYZ
w
19
[ ] [ ]0|var1|var =+= YZYZw
Linear discriminant analysiswe have already seen that
( ) ( )2[ ] [ ]( ) [ ]( )[ ][ ] ww
wYZEYZETT
TYZYZ
0101
201
2||
0|1|
µµµµ
µµ
−−=
−==−=
also
[ ][ ]0101
( ){ }[ ] [ ]( ){ }[ ]( ){ }iYxwE
iYiYZEzEiYZ
iT
YZ
YZYZ
=−=
==−==
|
|||var2
|
2||
µ[ ]( ){ }[ ][ ]{ }iYwxxwE
iYxwE
T
Tii
TYZ
iYZ
Σ
=−−= |
|
|
|
µµ
µ
20
ww iTΣ=
Linear discriminant analysisand
[ ] [ ]( )[ ] [ ]YZYZ
YZEYZEwJ YZYZ
2||
0|var1|var0|1|
)(=+==−=
=
( )( )( )ww
wwT
TT
01
0101 Σ+Σ−−
=µµµµ
which can be written as
( )01
T ( )( )Tbetween class scatter
wSwwSwwJ
WT
BT
=)( ( )( )( )01
0101
Σ+Σ=−−=
W
TB
SS µµµµ
21
within class scatter
Linear discriminant analysismaximizing the ratio
wSwT
is equivalent to maximizing the numerator while keeping the
wSwwSwwJ
WT
B=)(
• is equivalent to maximizing the numerator while keeping the denominator constant, i.e.
KwSwwSw WT
BT =subject tomax
• and can be accomplished using Lagrange multipliersd fi th L i
KwSwwSw WBw subject to max
• define the Lagrangian
( )KwSww - SwL WT
BT −= λ
22
• and maximize with respect to both w and λ
Linear discriminant analysissetting the gradient of
( )
with respect to w to zero we get
( ) Kw S- SwL WBT λλ +=
with respect to w to zero we get
or( ) 02 ==∇ w S- SL WBw λ
or
wSwS WB λ=
this is a generalized eigenvalue problemthe solution is easy when exists( ) 1
011 −− Σ+Σ=wS
23
Linear discriminant analysisin this case
λ1
and, using the definition of SB
wwSS BW λ=−1
noting that (µ1-µ0)Tw = α is a scalar this can be written as
( )( ) wwS TW λµµµµ =−−−
01011
noting that (µ1 µ0) w α is a scalar this can be written as
( ) wSW αλµµ =−−
011
and since we don’t care about the magnitude of w
( ) ( ) ( )011
01011* µµµµ −Σ+Σ=−= −−
WSw
24
( ) ( ) ( )010101 µµµµW
Linear discriminant analysisnote that we have seen this before• for a classification problem with Gaussian classes of equalfor a classification problem with Gaussian classes of equal
covariance Σi = Σ, the BDR boundary is the plane of normal
( )jiw µµ −Σ= −1
• if Σ1 = Σ0, this is also the LDA solution
w
µi
µjx0
Gaussian classes
25
Gaussian classes, equal covariance Σ
Linear discriminant analysisthis gives two different interpretations of LDA• it is optimal if and only if the classes are Gaussian and haveit is optimal if and only if the classes are Gaussian and have
equal covariance• better than PCA, but not necessarily good enough• a classifier on the LDA feature, is equivalent to
• the BDR after the approximation of the data by two Gaussians with equal covariance
26
27