Post on 19-Jan-2016
Elements ofPattern Recognition
CNS/EE-148 -- Lecture 5
M. WeberP. Perona
What is Classification?
• We want to assign objects to classes based on a selection of attributes (features).
• Examples:– (age, income) {credit worthy, not credit worthy}
– (blood cell count, body temp) {flue, hepatitis B, hepatitis C}
– (pixel vector) {Bill Clinton, coffee cup}
• Feature vector can be continuous, discrete or mixed.
What is Classification?
• Want to find a function from measurements to class labels decision boundary.
{ }?2102 ,,,: CCCCRc →
x1
x2
Signal 1
Noise
Signal 2
• Statistical methods use pdf: p(C,x)
• Assume p(C,x) known for now
Space of Feature Vectors
Some Terminology
• p(C) is called a prior or a priori probability
• p(x|C) is called a class-conditional density
or likelihood of C with respect to x
• p(C|x) is called a posterior or
a posteriori probability
Examples• One measurement, symmetric cost, equal priors
bad
x
p(x|C1) p(x|C2)
⎭⎬⎫
⎩⎨⎧
==
=12
21
)(),|(
)(),|()|(
CxcifxCP
CxcifxCPxerrorP∫= dxxpxerrorPerrorP )()|()(
Examples
• One measurement, symmetric cost, equal priors
good
x
p(x|C1) p(x|C2)
How to Make the Best Decision? (Bayes Decision Theory)
• Define a cost function for mistakes, e.g.
• Minimize expected loss (risk) over entire p(C,x).
• Sufficient to assure optimal decision for each individual x.
• Result: decide according to maximum posterior probability:
ijjiL δ−=1),(
dxxpxxcCLE
xxcCLEExcCLER
)(]|))(,([
]]|))(,([[))](,([
∫===
∑=
=N
iii xCpxcCLxxcCLE
1
)|())(,(]|))(,([
)|(max)( xCpxc ii
=
Two Classes, C1, C2
• It is helpful to consider the likelihood ratio:
• Use known priors p(Ci) or ignore them.
• For more elaborate loss function (proof is easy):
• g(x) is called a discriminant function
)()|(
)()|(
)|(
)|(
22
11
2
1
CpCxp
CpCxp
xCp
xCp=
€
g(x) ≡p(x | C1)
p(x | C2)≥
l12 − l22
l21 − l11
p(C2)
p(C1)
?
Discriminant Functions for Multivariate Gaussian Class Conditional Densities
• Two multivariate Gaussians in d dimensions• Since log is monotonic, we can look at log g(x).
)()()(
)(
)|(
)|(log)(log 21
2
1
2
1 xgxgCp
Cp
Cxp
Cxpxg −==
Mahalanobis Distance2 superfluous
)(loglog2
12log
2)()(
2
1)( 1
iiiiT
ii Cpd
xxxg +Σ−−−Σ−−= − πμμ
Mahalanobis Distance
• iso-distance lines = iso-probability lines
• Decision surface:
x1
x2
μ1
)()()( 12ii
Ti xxxd μμ −Σ−= −
μ2
decisionsurface
.)()( 22
21 constxdxd =−
Case 1: Σi = 2I
• Discriminant functions…
• …simplify to:
[ ]
)(
)(log
2
)()()()(
)(
)(log22
2
1
)(
)(log
22)()(
2
12
21212
221
2
12221112
2
12
2
2
2
2
121
Cp
Cpx
Cp
Cpxxxxxx
Cp
Cpxxxgxg
TT
TTTTTT
+−−
−−−
=
++−+−+−=
+−
+−
−=−
σ
μμμμ
σ
μμμ
μμμμμμσ
σ
μ
σ
μ
)(loglog2
1)()(
2
1)( 1
iiiiT
ii Cpxxxg +Σ−−Σ−−= − μμ
Decision Boundary
)(
)(log
2
1)(
)(
)(
)(log
2
1)()(
0)()(
2
1
21
2
21221
21
2
122
21221
21
Cp
Cpx
Cp
Cpx
xgxg
T
T
μμμμμ
μμμμ
μμμμμ
−−−=−
−−
⇔
−−=−−⇒
=−
•If μ2=0, we obtain...
The matched filter! With an expression for the threshold.
)(
)(log
2
1
2
1
1
2
11
1
Cp
Cpx
T
μμ
μμ
−=
Two Signals and Additive White Gaussian Noise
Signal 1
Signal 2
x
μ1
μ2
μ1-μ2
x-μ2
x1
x2
)(
)(log
2
1)(
)(
2
1
21
2
21221
21
Cp
Cpx
T
μμμμμ
μμμμ
−−−=−
−−
)(
)(log
2
1
2
1
21
2
21 Cp
Cp
μμμμ−
−−
Case 2: Σi = Σ
• Two classes, 2D measurements, p(x|C) are multivariate Gaussians with equal covariance matrices.
• Derivation is similar– Quadratic term vanishes since it is independent of class
– We obtain a linear decision surface
• Matlab demo
Case 3: General Covariance Matrix
• See transparency
Isn’t this to simple?
• Not at all…
• It is true that images form complicated manifolds (from a pixel point of view, translation, rotation and scaling are all highly non-linear operations)
• The high dimensionality helps
Assume Unknown Class Densitites
• In real life, we do not know the class conditional densities.
• But we do have example data.
• This puts us in the typical machine learning scenario:We want to learn a function, c(x), from examples.
• Why not just estimate class densities from examples and apply the previous ideas?– Learn Gaussian (simple density): in N dimensions need N2 samples
at least!• 10x10 pixels 10,000 examples!
– Avoid estimating densities whenever you can! (too general)– posterior is generally simpler than class conditional (see transparency)
Remember PCA?• Principal components are
eigenvectors of covariance matrix
• Use reconstruction error for recognition (e.g. Eigenfaces)– good
• reduces dimensionality
– bad• no model within subspace
• linearity may be inappropriate
• covariance not appropriate to optimize discrimination
x1
x2
zUx
USUCxxN
T
i
Tii
ˆ
))((1
+≈
==−−∑μ
μμ u1
μ
x
Fisher’s Linear Discriminant
• Goal: Reduce dimensionality before training classifiers etc. (Feature Selection)
• Similar goal as PCA!
• Fisher has classification in mind…
• Find projection directions such that separation is easiest
• Eigenfaces vs. Fisherfaces
x1
x2
Fisher’s Linear Discriminant
• Assume we have n d-dimensional samples x1,…,xn
• n1 from set (class) X1 and n2 from set X2
• we form linear combinations:
• and obtain y1…,yn
• only direction of w is important
xwy T=
Objective for Fisher• Measure the separation as the distance between the means
after projecting (k = 1,2):
• Measure the scatter after projecting:
• Objective becomes to maximize
kT
Xx
T
kYykk mwxw
ny
nm
kk
∑∑∈∈
===11~
∑∈
−=kYy
kk mys 22 )~(~
22
21
2
21
~~
~~)(
ss
mmwJ
+−
=
• We need to make the dependence on w explicit:
• Defining the within-class scatter matrix, SW=S1+S2, we obtain
• Similarly for the separation (between-class scatter matrix)
• Finally we can write
wSwwmxmxw
mwxws
kT
Xx
Tkk
T
Xxk
TTk
k
k
≡−−=
−=
∑
∑
∈
∈
))((
)(~ 22
wSwss WT=+ 2
221
~~
wSwwmmmmw
mwmwmm
BTTT
TT
=−−
=−=−
))((
)()~~(
2121
221
221
wSw
wSwwJ
WT
BT
=)(
Fisher’s Solution
• Is called a generalized Rayleigh quotient. Any w that maximizes J must satisfy the generalized eigenvalue problem
• Since SB is very singular (rank 1), and SBw is in the direction of (m1-m2), we are done:
wSw
wSwwJ
WT
BT
=)(
wSwS WB λ=
)( 211 mmSw W −= −
Comments on FLD
• We did not follow Bayes Decision Theory
• FLD is useful for many types of densities
• Fisher can be extended (see demo):– more than one projection direction– more than two clusters
• Let’s try it out: Matlab Demo
Fisher vs. Bayes
• Assume we do have identical Gaussian class densities, then Bayes says:
• while Fisher says:
• Since SW is proportional to the covariance matrix, w is in the same direction in both cases.
• Comforting...
)(
0
211
0
μμ −Σ=
=+−w
wxwT
)( 211 mmSw W −= −
What have we achieved?
• Found out that maximum posterior strategy is optimal. Always.
• Looked at different cases of Gaussian class densities, where we could derive simple decision rules.
• Gaussian classifiers do reasonable jobs!• Learned about FLD which is useful and often
preferable to PCA.
Just for Fun: Support Vector Machine
• Very fashionable…s.o.t.a?
• Does not model densities
• Fits decision surface directly
• Maximizes margin reduces “complexity”
• Decision surface only depends on nearby samples
• Matlab Demo
x1
x2
Learning Algorithms
Set of functions LearningAlgorithm
Examples:(xi,yi)
p(x,y)
Learnedfunction
y = f(x)f = ?
Assume Unknown Class Densitites
• SVM Examples
• Densitites are hard to estimate -> avoid it– example from Ripley
• Give intuitions on overfitting
• Need to learn– Standard machine learning problem– Training/Test sets