Download - Methods for Intelligent Systems - polimi.it

Methods for Intelligent Systems Lecture Notes on Feature Projection

2009 - 2010

[email protected]

DepartmentofElectronicsandInformationPolitecnicodiMilano

FeatureProjection

• Goal:ReducethenumberoffeaturesDtoimproveclassificationaccuracy• Usealinearcombinationoforiginalfeatures

• Projectsdataintodifferentspaceinwhichclassesmightbebestseparated

• Resultingfeaturescouldlosetheirmeaning

x̄ = HxH : RD → RM

x̄ =

h11 h12 . . . h1N

h21 h22 . . . h2N

. . .hM1 hM2 . . . hMN

x

Hx

x̄x

f(x̄)

y

N

D M 1

2 Methods for Intelligent Systems A.A 2009‐2010

SignalClassificationvsSignalRepresentation

• SignalRepresentation(PCA)– Informationassociatedwiththedatadistribution(i.e.meanandvariance)– Norelationshipwiththeclassificationproblem– Datashouldhavesimilarvariances

• SignalClassification(LDA)– Informationassociatedtodiscriminationcapabilities(inter‐classdistance)– Tendencytoover‐fitthetrainingdatawithpoorgeneralizationabilities.


Principal Component Analysis (PCA)

• Hp:Datafollowsamulti‐dimensionalGaussiandistribution

• Goal:Findtheprincipalcomponentofthedistributionthataccountforthemaximumvarianceofdata

• Covariancematrix

• Decompositionof

– Eigenvectorsareprincipalcomponentsofdistribution

– Eigenvaluesarethevarianceofdataalongprincipalcomponents

Σx =

σ21 σ12 . . . σ1N

σ21 σ22 . . . σ2N

. . .σN1 σN2 . . . σ2

N

σij = E[(xi − µi)(xj − µj)]

Σx Σxϕ− λϕ = 0ϕ

λ

λi

ϕi


PCAandvectors

• Samplesrepresentedina2‐Dvectorformat

areaxisvectors

• SamplesinaD‐dimensionalspace

• SupposewewanttorepresentthedatainM‐dimensional(M<D)space– Replacethefeatureswithaconstantvalue

• Wehaveanerror

• Meansquarederror

x1

x2

x =∑D

i=1 xiϕi

x = x1ϕ1 + x2ϕ2

ϕ1, ϕ2

ϕ1

ϕ2

x̂(M) =∑M

i=1 xiϕi +∑D

i=M+1 biϕi

∆x(M) = x− x̂(M) =∑D

i=M+1(xi − bi)ϕi

ε2(M) = E[∆x(M)2] =∑D

i=M+1 E[(xi − bi)2]


PCAandmeansquarederror

• Findthatminimize

• Addingtheorthonormalityconstraint

• Findthatminimize

bi ε2(M)

ε2(M) =∑D

i=M+1 ϕT Σxϕ +∑D

i=M+1 λi(1− ϕTi ϕ)

∂∂bi

E[(xi − bi)2] = −2(E[xi]− bi) = 0 bi = E[xi]

ε2(M) =D∑

i=M+1

E[(xi − E[xi])2]

=D∑

i=M+1

E[(xϕi − E[xϕi])T (xϕi − E[xϕi])]

=D∑

i=M+1

ϕT E[(x− E[x])(x− E[x])T ]ϕ =D∑

i=M+1

ϕT Σxϕ

ϕi ε2(M)∂

∂ϕiε2(M) = 2(Σxϕi − λiϕi = 0) Σxϕi = λiϕi


PCAStepbystep

1. Giventhedataset

2. Computethecovariancematrix

3. Solvethecharacteristicequation

4. ChosethefirstMeigenvectorscorrespondingtothelargesteigenvalues

5. Projectthedata

x

Σx

Σxϕ− λϕ = 0

H = [ϕi|ϕ2| . . . |ϕM ] x̄ = [ϕi|ϕ2| . . . |ϕM ]x


PCAExample

• Dataset• Covariancematrix

• Decompositionofcovariancematrix

x = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)}

Σx =[

7.1429 4.85714.8571 4.0000

]

µ = (5, 5)

Σxϕ− λϕ = 0∣∣∣∣

7.1429− λ1 4.85714.8571 4.0000− λ2

∣∣∣∣ = 0

H = [ϕ1|ϕ2] =[−0.8086 −0.5883−0.5883 0.8086

]

λ1 = 10.6764, λ2 = 0.4664


PCAExample(2)

• Projectionintoprincipalcomponents

• RationofcomponentvarianceandtotalvarianceλiPD

i=1 λi

λ1PDi=1 λi

= 0.9581 λ2PDi=1 λi

= 0.0419


PCANotes

• PCAaimstodecomposethecovariancematrix

• isestimatedundertheassumptionofaGaussianDistributions

– Ifdataarenotnormallydistributed,PCAde‐correlatesfeatures

• PCAdoesnotuseclasslabelstoprojectdata

• Projectiondependsonlyonthedatastructure

• Bystretchingoneoftheprincipalcomponent,thedistributionofdatadoesnotchange

• Thereisnoguaranteethatprincipalcomponentsbestseparateclasses

Σx


Fisher’sLinearDiscriminantAnalysis(LDA)

• LDAisalinearprojection

• theprojectionsmaximizesintra‐classseparability(amongdifferentclasses)andminimizesinter‐classseparability(inthesameclass).

• Theprojectionisfindbysolvinganoptimizationproblem

– Whichmeasureshouldbeminimized?



• Apossiblemeasureofseparability:samplemean– Samplemeanforclass

– Projectedmean

– Measureofseparation

• thedistancebetweentheprojectedmeansisnotenough:itdoesnottakeinaccountthestandarddeviation

µi = 1Ni

∑x∈ωi

x

ωi

µ̃i = 1Ni

∑x∈ωi

WT x = WT µi

J(W ) = |µ̃1 − µ̃2|



• Within‐classscattermatrix

• Between‐classscattermatrix

• Measureofseparation

– Projectedmeansarewellseparated– Projectedvariancearesmall

Sω =∑

ωi∈Y

∑x∈ωi

(x− µi)(x− µi)T

Sb =∑

ωi∈Y Ni(µ− µi)(µ− µi)T

s̃2i =

∑x∈ωi

(WT x−XT µi) =∑

x∈ωiWT (x− µi)(x− µi)T W = WT SωiW

J(W ) = |µ̃1−µ̃2|2s̃21+s̃2

2

s̃21 + s̃2

2 = WT SωW

J(W ) = W T SbWW T SwW

(µ̃1 − µ̃2)2 = (WT µ1 −WT µ2)2 = WT (µ1 − µ2)(µ1 − µ2)W = WT SbW



• FindWthatmaximizesJ(W)

– Dividingby

– SolvingthegeneralizedeigenvectorproblemweobtainWthatmaximizeJ

∂∂W J(W ) = ∂

∂W

[W T SbWW T SwW

]= 0

(WT SwW ) ∂∂W (WT SbW )− (WT SbW ) ∂

∂W (WT SwW ) = 04

(WT SwW )

(WT SwW )2SbW − (WT SbW )2SwW = 0

W T SwWW T SwW SbW − W T SbW

W T SwW SwW = 0

SbW − JSwW = 0

S−1w SbW − JW = 0


LDAstepbystep

1. Giventhedataset

2. Computestheclassstatistic

3. Computethewithin‐classscattermatrix

4. Computethebetween‐classscattermatrix

5. Solvethegeneralizedeigenvalueproblem

x

Sωi =∑

x∈ωi(x− µi)(x− µi)T

Sω =∑

i=1..|Y | Sωi

Sb =∑




LDAexample

1. Dataset

2. Computeclassstatistic

– Class1

– Class2

3. Computethewithin‐classscattermatrix

x={(4,1), (2,4), (2,3), (3,6), (4,4) , (9,10), (6,8), (9,5), (8,7), (10,8)}y = {1, 1, 1, 1, 1, 2, 2, 2, 2, 2}

Sωi =∑

x∈ωi(x− µi)(x− µi)T

µ1 = [3, 3.6] Sω1 =[

4 −2−2 13.2

]

µ2 = [8.4, 7.6] Sω2 =[

9.2 −0.2−0.2 13.2

]


Sω =[

13.2 −0.2−0.2 26.4

]

LDAexample(2)

4. Computethebetween‐classscattermatrix

5. Solvethegeneralizedeigenvalueproblem

Sb =∑


Sb =[

72.9 5454 40

]


|S−1ω Sb − λI| = 0

∣∣∣∣5.9462− λ1 4.4046

2.5410 1.8822− λ2

∣∣∣∣ = 0

W =[

0.9196 −0.59520.3930 0.8036

]λ1 = 7.8284 λ2 = 0


NoteonLDA

• ProducesonlyC‐1projections• Hpofunimodaldistributionofdatawithineachclass.

– Ifdataarehighlynonlineartheresultingprojectionissuboptimal• Non‐parametricLinearDiscriminantAnalysisremovetheunimodalassumpion

– iscomputedwithlocalinformationtroughaK‐NN– resultinafullrankmatrixandprojectionismadeovermorethanc‐1

classes• LDAfailswhentheinformationiscontainedinthevarianceratherthanthemean

– Encodethevarianceasanewfeature!

Sb

Sb


Otherdimensionalityreductiontechniques

• KernelPCA• IndipendentComponentAnalysis(ICA)• MultilayerPerceptron• Selforganizingmaps(SOMs)• Sammon'smap• Supportvectorsmachine(SVM)

– Marginmaximization