PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

27
PCA and LDA Nuno Vasconcelos ECE Department, UCSD

Transcript of PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear...

Page 1: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA and LDA

Nuno Vasconcelos ECE Department, UCSDp ,

Page 2: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when

viewed from the full space, e.g.

1D subspace in 2D 2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids

2

Page 3: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Principal component analysisIf y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y2

φ• principal components φi are the eigenvectors of Σ

• principal lengths λi are the λ1λ2

y

φ1φ2

p p g ieigenvalues of Σ

by computing the eigenvalues we know if the data is flat

y1

by computing the eigenvalues we know if the data is flatλ1 >> λ2: flat λ1=λ2: not flat

y2 y2

λ1λ2y1 λ1

λ2

y1

3

y1 λ1 y1

Page 4: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Principal component analysis (learning)

4

Page 5: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Principal component analysis

5

Page 6: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Principal component analysisthere is an alternative manner to compute the principal components, based on singular value decompositionp , g pSVD:• any real n x m matrix (n>m) can be decomposed as

TA ΜΠΝ=

• where M is a n x m column orthonormal matrix of left singular vectors (columns of M)

• Π a m x m diagonal matrix of singular values• NT a m x m row orthonormal matrix of right singular vectors

(columns of N)

II TT ΝΝΜΜ

6

II TT =ΝΝ=ΜΜ

Page 7: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA by SVDto relate this to PCA, we consider the data matrix

⎤⎡ ||

⎥⎥⎥⎤

⎢⎢⎢⎡

=||

1 nxxX K

the sample mean is

⎥⎦⎢⎣ ||p

111||

11 X⎥⎤

⎢⎡⎥⎤

⎢⎡

∑ M 11||

1 Xn

xxn

xn n

ii =

⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

== ∑ MKµ

7

Page 8: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA by SVDand we can center the data by subtracting the mean to each column of Xthis is the centered data matrix

⎤⎡⎤⎡ ||||

⎥⎥⎥⎤

⎢⎢⎢⎡

−⎥⎥⎥⎤

⎢⎢⎢⎡

= nc xxX||

||

||

||

1 µµ KK

⎞⎜⎛===

⎥⎦⎢⎣⎥⎦⎢⎣

TTT IXXXX 1111111

||||

µ⎠

⎜⎝

−=−=−=n

IXXn

XX 11111µ

8

Page 9: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA by SVDthe sample covariance is

11 ( )( ) ( )∑∑ =−−=Σi

Tci

ci

i

Tii xx

nxx

n11 µµ

where xic is the ith column of Xc

this can be written as

T

c

cc XXx

1||

1 1⎥⎤

⎢⎡ −−⎥⎤

⎢⎡

Σ M Tcc

cn

cn

c XXn

xxx

n||

1 =⎥⎥⎥

⎦⎢⎢⎢

⎣ −−⎥⎥⎥

⎦⎢⎢⎢

=Σ MK

9

Page 10: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA by SVDthe matrix

⎥⎤

⎢⎡ −− cx 1

⎥⎥⎥

⎦⎢⎢⎢

⎣ −−=

cn

Tc

xX M

is real n x d. Assuming n > d it has SVD decomposition

⎦⎣ n

and

TTcX ΜΠΝ= II TT =ΝΝ=ΜΜ

and

TTTTXX ΝΝΠ=ΜΠΝΝΠΜ==Σ 2111

10

cc nnXX

nΝΝΠΜΠΝΝΠΜΣ

Page 11: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA by SVD

T

nΝ⎟⎠⎞

⎜⎝⎛ ΠΝ=Σ 21

noting that N is d x d and orthonormal, and Π2 diagonal, shows that this is just the eigenvalue decomposition of Σ

n ⎠⎝

j g pit follows that • the eigenvectors of Σ are the columns of N• the eigenvalues of Σ are

21i πλ =

this gives an alternative algorithm for PCA

ini πλ

11

Page 12: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

PCA by SVDcomputation of PCA by SVDgiven X with one example per columngiven X with one example per column• 1) create the centered data-matrix

⎞⎛ 1

2) t it SVD

TTTc X

nIX ⎟

⎠⎞

⎜⎝⎛ −= 111

• 2) compute its SVD

TTcX ΜΠΝ=

• 3) principal components are columns of N, eigenvalues are

21

12

21ini πλ =

Page 13: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Limitations of PCAPCA is not optimal for classification• note that there is no mention of the class label in the definition ofnote that there is no mention of the class label in the definition of

PCA• keeping the dimensions of largest energy (variance) is a good

idea but not always enoughidea, but not always enough• certainly improves the density estimation, since space has

smaller dimension• but could be unwise from a classification point of view• the discriminant dimensions could be thrown out

it is not hard to construct examples where PCA is theit is not hard to construct examples where PCA is the worst possible thing we could do

13

Page 14: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Exampleconsider a problem with• two n-D Gaussian classes with covariance Σ=σ2I σ2 = 10• two n-D Gaussian classes with covariance Σ=σ I, σ = 10

• we add an extra variable which is the class label itself

)10,(~ INX iµ• we add an extra variable which is the class label itself

• assuming that P (0)=P (1)=0 5

] ,[' iXX =• assuming that PY(0)=PY(1)=0.5

5.015.005.0][ =×+×=YE)501(50)500(50]var[ 22 ×+×Y

• dimension n+1 has the smallest variance and is the first to be 10125.0

)5.01(5.0)5.00(5.0]var[<=

−×+−×=Y

14

discarded!

Page 15: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Examplethis is• a very contrived examplea very contrived example• but shows that PCA can throw away all the discriminant info

does this mean you should never use PCA?y• no, typically it is a good method to find a suitable subset of

variables, as long as you are not too greedy• e g if you start with n = 100 and know that there are only 5• e.g. if you start with n = 100, and know that there are only 5

variables of interest• picking the top 20 PCA components is likely to keep the desired 5• your classifier will be much better than for n = 100, probably

not much worse than the one with the best 5 features

is there a rule of thumb for finding the number of PCA

15

is there a rule of thumb for finding the number of PCA components?

Page 16: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Principal component analysisa natural measure is to pick the eigenvectors that explain p % of the data variabilityp y• can be done by plotting the ratio rk as a function of k

∑=

k

ii

r 1

∑=

== n

ii

ikr

1

2

1

λ

• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

16

dataset

Page 17: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Fischer’s linear discriminantwhat if we really need to find the best features?• harder question, usually impossible with simple methods• there are better methods at finding discriminant directions

one good example is linear discriminant analysis (LDA)• the idea is to find the line that best separates the two classes

bad projectionbad projection

17

good projection

Page 18: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysiswe have two classes such that

[ ]iYXE |[ ] iYX iYXE µ==||

( )( )[ ] iT

iiYX iYXXE Σ==−− || µµ

and want to find the line

( )( )[ ] iiiYX || µµ

xwz T=that best separates themone possibility would be to maximize

xwz =

one possibility would be to maximize

[ ] [ ]( )2|| 0|1| ==−= YZYZ YZEYZE

18

[ ] [ ]( ) [ ]( )2012

|| 0|1| µµ −==−= TTYX

TYX wYxwEYxwE

Page 19: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysishowever, this [ ]( )201 µµ −Tw

can be made arbitrarily large by simply scaling wwe are only interested in the direction, not the magnitudeneed some type of normalizationFischer suggested

max =scatterclasswithinscatterclassbetween

[ ] [ ]( )[ ] [ ]0|var1|var

0|1|max

2

||

=+==−=

YZYZYZEYZE

scatterclasswithin

YZYZ

w

19

[ ] [ ]0|var1|var =+= YZYZw

Page 20: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysiswe have already seen that

( ) ( )2[ ] [ ]( ) [ ]( )[ ][ ] ww

wYZEYZETT

TYZYZ

0101

201

2||

0|1|

µµµµ

µµ

−−=

−==−=

also

[ ][ ]0101

( ){ }[ ] [ ]( ){ }[ ]( ){ }iYxwE

iYiYZEzEiYZ

iT

YZ

YZYZ

=−=

==−==

|

|||var2

|

2||

µ[ ]( ){ }[ ][ ]{ }iYwxxwE

iYxwE

T

Tii

TYZ

iYZ

Σ

=−−= |

|

|

|

µµ

µ

20

ww iTΣ=

Page 21: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysisand

[ ] [ ]( )[ ] [ ]YZYZ

YZEYZEwJ YZYZ

2||

0|var1|var0|1|

)(=+==−=

=

( )( )( )ww

wwT

TT

01

0101 Σ+Σ−−

=µµµµ

which can be written as

( )01

T ( )( )Tbetween class scatter

wSwwSwwJ

WT

BT

=)( ( )( )( )01

0101

Σ+Σ=−−=

W

TB

SS µµµµ

21

within class scatter

Page 22: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysismaximizing the ratio

wSwT

is equivalent to maximizing the numerator while keeping the

wSwwSwwJ

WT

B=)(

• is equivalent to maximizing the numerator while keeping the denominator constant, i.e.

KwSwwSw WT

BT =subject tomax

• and can be accomplished using Lagrange multipliersd fi th L i

KwSwwSw WBw subject to max

• define the Lagrangian

( )KwSww - SwL WT

BT −= λ

22

• and maximize with respect to both w and λ

Page 23: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysissetting the gradient of

( )

with respect to w to zero we get

( ) Kw S- SwL WBT λλ +=

with respect to w to zero we get

or( ) 02 ==∇ w S- SL WBw λ

or

wSwS WB λ=

this is a generalized eigenvalue problemthe solution is easy when exists( ) 1

011 −− Σ+Σ=wS

23

Page 24: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysisin this case

λ1

and, using the definition of SB

wwSS BW λ=−1

noting that (µ1-µ0)Tw = α is a scalar this can be written as

( )( ) wwS TW λµµµµ =−−−

01011

noting that (µ1 µ0) w α is a scalar this can be written as

( ) wSW αλµµ =−−

011

and since we don’t care about the magnitude of w

( ) ( ) ( )011

01011* µµµµ −Σ+Σ=−= −−

WSw

24

( ) ( ) ( )010101 µµµµW

Page 25: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysisnote that we have seen this before• for a classification problem with Gaussian classes of equalfor a classification problem with Gaussian classes of equal

covariance Σi = Σ, the BDR boundary is the plane of normal

( )jiw µµ −Σ= −1

• if Σ1 = Σ0, this is also the LDA solution

w

µi

µjx0

Gaussian classes

25

Gaussian classes, equal covariance Σ

Page 26: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

Linear discriminant analysisthis gives two different interpretations of LDA• it is optimal if and only if the classes are Gaussian and haveit is optimal if and only if the classes are Gaussian and have

equal covariance• better than PCA, but not necessarily good enough• a classifier on the LDA feature, is equivalent to

• the BDR after the approximation of the data by two Gaussians with equal covariance

26

Page 27: PCA and LDA - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf · Linear discriminant analysis note that we have seen this before • for a classification problem

27