Methods for Intelligent Systems Lecture Notes on Feature Projection
2009 - 2010
DepartmentofElectronicsandInformationPolitecnicodiMilano
FeatureProjection
• Goal:ReducethenumberoffeaturesDtoimproveclassificationaccuracy• Usealinearcombinationoforiginalfeatures
• Projectsdataintodifferentspaceinwhichclassesmightbebestseparated
• Resultingfeaturescouldlosetheirmeaning
x̄ = HxH : RD → RM
x̄ =
h11 h12 . . . h1N
h21 h22 . . . h2N
. . .hM1 hM2 . . . hMN
x
Hx
x̄x
f(x̄)
y
N
D M 1
2 Methods for Intelligent Systems A.A 2009‐2010
SignalClassificationvsSignalRepresentation
• SignalRepresentation(PCA)– Informationassociatedwiththedatadistribution(i.e.meanandvariance)– Norelationshipwiththeclassificationproblem– Datashouldhavesimilarvariances
• SignalClassification(LDA)– Informationassociatedtodiscriminationcapabilities(inter‐classdistance)– Tendencytoover‐fitthetrainingdatawithpoorgeneralizationabilities.
3 Methods for Intelligent Systems A.A 2009‐2010
Principal Component Analysis (PCA)
• Hp:Datafollowsamulti‐dimensionalGaussiandistribution
• Goal:Findtheprincipalcomponentofthedistributionthataccountforthemaximumvarianceofdata
• Covariancematrix
• Decompositionof
– Eigenvectorsareprincipalcomponentsofdistribution
– Eigenvaluesarethevarianceofdataalongprincipalcomponents
Σx =
σ21 σ12 . . . σ1N
σ21 σ22 . . . σ2N
. . .σN1 σN2 . . . σ2
N
σij = E[(xi − µi)(xj − µj)]
Σx Σxϕ− λϕ = 0ϕ
λ
λi
ϕi
4 Methods for Intelligent Systems A.A 2009‐2010
PCAandvectors
• Samplesrepresentedina2‐Dvectorformat
areaxisvectors
• SamplesinaD‐dimensionalspace
• SupposewewanttorepresentthedatainM‐dimensional(M<D)space– Replacethefeatureswithaconstantvalue
• Wehaveanerror
• Meansquarederror
x1
x2
x =∑D
i=1 xiϕi
x = x1ϕ1 + x2ϕ2
ϕ1, ϕ2
ϕ1
ϕ2
x̂(M) =∑M
i=1 xiϕi +∑D
i=M+1 biϕi
∆x(M) = x− x̂(M) =∑D
i=M+1(xi − bi)ϕi
ε2(M) = E[∆x(M)2] =∑D
i=M+1 E[(xi − bi)2]
5 Methods for Intelligent Systems A.A 2009‐2010
PCAandmeansquarederror
• Findthatminimize
• Addingtheorthonormalityconstraint
• Findthatminimize
bi ε2(M)
ε2(M) =∑D
i=M+1 ϕT Σxϕ +∑D
i=M+1 λi(1− ϕTi ϕ)
∂∂bi
E[(xi − bi)2] = −2(E[xi]− bi) = 0 bi = E[xi]
ε2(M) =D∑
i=M+1
E[(xi − E[xi])2]
=D∑
i=M+1
E[(xϕi − E[xϕi])T (xϕi − E[xϕi])]
=D∑
i=M+1
ϕT E[(x− E[x])(x− E[x])T ]ϕ =D∑
i=M+1
ϕT Σxϕ
ϕi ε2(M)∂
∂ϕiε2(M) = 2(Σxϕi − λiϕi = 0) Σxϕi = λiϕi
6 Methods for Intelligent Systems A.A 2009‐2010
PCAStepbystep
1. Giventhedataset
2. Computethecovariancematrix
3. Solvethecharacteristicequation
4. ChosethefirstMeigenvectorscorrespondingtothelargesteigenvalues
5. Projectthedata
x
Σx
Σxϕ− λϕ = 0
H = [ϕi|ϕ2| . . . |ϕM ] x̄ = [ϕi|ϕ2| . . . |ϕM ]x
7 Methods for Intelligent Systems A.A 2009‐2010
PCAExample
• Dataset• Covariancematrix
• Decompositionofcovariancematrix
x = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)}
Σx =[
7.1429 4.85714.8571 4.0000
]
µ = (5, 5)
Σxϕ− λϕ = 0∣∣∣∣
7.1429− λ1 4.85714.8571 4.0000− λ2
∣∣∣∣ = 0
H = [ϕ1|ϕ2] =[−0.8086 −0.5883−0.5883 0.8086
]
λ1 = 10.6764, λ2 = 0.4664
8 Methods for Intelligent Systems A.A 2009‐2010
PCAExample(2)
• Projectionintoprincipalcomponents
• RationofcomponentvarianceandtotalvarianceλiPD
i=1 λi
λ1PDi=1 λi
= 0.9581 λ2PDi=1 λi
= 0.0419
9 Methods for Intelligent Systems A.A 2009‐2010
PCANotes
• PCAaimstodecomposethecovariancematrix
• isestimatedundertheassumptionofaGaussianDistributions
– Ifdataarenotnormallydistributed,PCAde‐correlatesfeatures
• PCAdoesnotuseclasslabelstoprojectdata
• Projectiondependsonlyonthedatastructure
• Bystretchingoneoftheprincipalcomponent,thedistributionofdatadoesnotchange
• Thereisnoguaranteethatprincipalcomponentsbestseparateclasses
Σx
10 Methods for Intelligent Systems A.A 2009‐2010
Fisher’sLinearDiscriminantAnalysis(LDA)
• LDAisalinearprojection
• theprojectionsmaximizesintra‐classseparability(amongdifferentclasses)andminimizesinter‐classseparability(inthesameclass).
• Theprojectionisfindbysolvinganoptimizationproblem
– Whichmeasureshouldbeminimized?
11 Methods for Intelligent Systems A.A 2009‐2010
Fisher’sLinearDiscriminantAnalysis(LDA)
• Apossiblemeasureofseparability:samplemean– Samplemeanforclass
– Projectedmean
– Measureofseparation
• thedistancebetweentheprojectedmeansisnotenough:itdoesnottakeinaccountthestandarddeviation
µi = 1Ni
∑x∈ωi
x
ωi
µ̃i = 1Ni
∑x∈ωi
WT x = WT µi
J(W ) = |µ̃1 − µ̃2|
12 Methods for Intelligent Systems A.A 2009‐2010
Fisher’sLinearDiscriminantAnalysis(LDA)
• Within‐classscattermatrix
• Between‐classscattermatrix
• Measureofseparation
– Projectedmeansarewellseparated– Projectedvariancearesmall
Sω =∑
ωi∈Y
∑x∈ωi
(x− µi)(x− µi)T
Sb =∑
ωi∈Y Ni(µ− µi)(µ− µi)T
s̃2i =
∑x∈ωi
(WT x−XT µi) =∑
x∈ωiWT (x− µi)(x− µi)T W = WT SωiW
J(W ) = |µ̃1−µ̃2|2s̃21+s̃2
2
s̃21 + s̃2
2 = WT SωW
J(W ) = W T SbWW T SwW
(µ̃1 − µ̃2)2 = (WT µ1 −WT µ2)2 = WT (µ1 − µ2)(µ1 − µ2)W = WT SbW
13 Methods for Intelligent Systems A.A 2009‐2010
Fisher’sLinearDiscriminantAnalysis(LDA)
• FindWthatmaximizesJ(W)
– Dividingby
– SolvingthegeneralizedeigenvectorproblemweobtainWthatmaximizeJ
∂∂W J(W ) = ∂
∂W
[W T SbWW T SwW
]= 0
(WT SwW ) ∂∂W (WT SbW )− (WT SbW ) ∂
∂W (WT SwW ) = 04
(WT SwW )
(WT SwW )2SbW − (WT SbW )2SwW = 0
W T SwWW T SwW SbW − W T SbW
W T SwW SwW = 0
SbW − JSwW = 0
S−1w SbW − JW = 0
14 Methods for Intelligent Systems A.A 2009‐2010
LDAstepbystep
1. Giventhedataset
2. Computestheclassstatistic
3. Computethewithin‐classscattermatrix
4. Computethebetween‐classscattermatrix
5. Solvethegeneralizedeigenvalueproblem
x
Sωi =∑
x∈ωi(x− µi)(x− µi)T
Sω =∑
i=1..|Y | Sωi
Sb =∑
ωi∈Y Ni(µ− µi)(µ− µi)T
S−1w SbW − JW = 0
15 Methods for Intelligent Systems A.A 2009‐2010
LDAexample
1. Dataset
2. Computeclassstatistic
– Class1
– Class2
3. Computethewithin‐classscattermatrix
x={(4,1), (2,4), (2,3), (3,6), (4,4) , (9,10), (6,8), (9,5), (8,7), (10,8)}y = {1, 1, 1, 1, 1, 2, 2, 2, 2, 2}
Sωi =∑
x∈ωi(x− µi)(x− µi)T
µ1 = [3, 3.6] Sω1 =[
4 −2−2 13.2
]
µ2 = [8.4, 7.6] Sω2 =[
9.2 −0.2−0.2 13.2
]
16 Methods for Intelligent Systems A.A 2009‐2010
Sω =[
13.2 −0.2−0.2 26.4
]
LDAexample(2)
4. Computethebetween‐classscattermatrix
5. Solvethegeneralizedeigenvalueproblem
Sb =∑
ωi∈Y Ni(µ− µi)(µ− µi)T
Sb =[
72.9 5454 40
]
S−1w SbW − JW = 0
|S−1ω Sb − λI| = 0
∣∣∣∣5.9462− λ1 4.4046
2.5410 1.8822− λ2
∣∣∣∣ = 0
W =[
0.9196 −0.59520.3930 0.8036
]λ1 = 7.8284 λ2 = 0
17 Methods for Intelligent Systems A.A 2009‐2010
NoteonLDA
• ProducesonlyC‐1projections• Hpofunimodaldistributionofdatawithineachclass.
– Ifdataarehighlynonlineartheresultingprojectionissuboptimal• Non‐parametricLinearDiscriminantAnalysisremovetheunimodalassumpion
– iscomputedwithlocalinformationtroughaK‐NN– resultinafullrankmatrixandprojectionismadeovermorethanc‐1
classes• LDAfailswhentheinformationiscontainedinthevarianceratherthanthemean
– Encodethevarianceasanewfeature!
Sb
Sb
18 Methods for Intelligent Systems A.A 2009‐2010
Otherdimensionalityreductiontechniques
• KernelPCA• IndipendentComponentAnalysis(ICA)• MultilayerPerceptron• Selforganizingmaps(SOMs)• Sammon'smap• Supportvectorsmachine(SVM)
– Marginmaximization
19 Methods for Intelligent Systems A.A 2009‐2010
Top Related