Summarized and revised by Hee-Woong Lim
-
Upload
camille-vaughn -
Category
Documents
-
view
38 -
download
0
description
Transcript of Summarized and revised by Hee-Woong Lim
Ch 4. Linear Models for Ch 4. Linear Models for Classification (1/2)Classification (1/2)
Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.
Summarized and revised by
Hee-Woong Lim
2 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
ContentsContents
4.1. Discriminant Functions
4.2. Probabilistic Generative Models
3 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Classification ModelsClassification Models
Linear classification model (D-1)-dimensional hyperplane for D-dimensional input space 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0)T
Discriminant function Directly assigns each vector x to a specific class. ex. Fishers linear discriminant
Approaches using conditional probability Separation of inference and decision states Two approaches
Direct modeling of the posterior probability Generative approach
– Modeling likelihood and prior probability to calculate the posterior probability
– Capable of generating samples
|kp C x
4 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Discriminant Functions-Two ClassesDiscriminant Functions-Two Classes
Classification by hyperplanes
or
T0
1
2
if 0,
otherwise,
y w
y C
C
x w x
x x
x
T
0where , and 1,
y
w
x w x
w w x x
5 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Discriminant Functions-Multiple ClassesDiscriminant Functions-Multiple Classes
One-versus-the-rest classifier K-1 classifiers for a K-class discriminant Ambiguous when more than two classifiers say ‘yes’.
One-versus-one classifier K(K-1)/2 binary discriminant functions Majority voting ambiguousness with equal scores
One-versus-the-rest One-versus-one
6 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Discriminant Functions-Multiple Classes Discriminant Functions-Multiple Classes (Cont’d)(Cont’d) K-class discriminant comprising K linear functions
Assigns x to the corresponding class having the maximum output.
The decision regions are always singly connected and convex.
T0 , 1,...,
if for
k k k
k k j
y w k K
C y y j k
x w x
x x x
ˆFor , , let 1 .
ˆThen 1 .
and for ,
ˆ ˆtherefore for .
A B k A B
k k A k B
k A j A k B j B
k j
C
y y y
y y y y j k
y y j k
x x x x x
x x x
x x x x
x x
7 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Approaches for Learning ParametersApproaches for Learning Parametersfor Linear Discriminant Functionsfor Linear Discriminant Functions
Least square method Fisher’s linear discriminant
Relation to least squares Multiple classes
Perceptron algorithm
8 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Least Square MethodLeast Square Method
Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t.
For a training data set, {xn, tn} where n = 1,…,N.The sum of squares error function is…
Minimizing SSE gives
Ty x W x TT1 2 0,where ... and .K k k kw W w w w w w
T
T T1 2 1 2
1Tr ,
2
where ... and ... .
D
N N
E
W XW T XW T
X x x x T t t t
1T T . W X X X T X T Pseudo inverse
9 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Least Square Method (Cont’d)Least Square Method (Cont’d)-Limit and Disadvantage-Limit and Disadvantage The least-squares solutions yields y(x) whose elements sum to 1, bu
t do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers
Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary.
ML under Gaussian conditional distribution Unimodal vs. multimodal
10 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Least Square Method (Cont’d)Least Square Method (Cont’d)-Limit and Disadvantage-Limit and Disadvantage
Lack of robustness comes from… Least square method corresponds to the maximum likelihood
under the assumption of Gaussian distribution. Binary target vectors are far from this assumption.
Least square solution Logistic regression
11 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Linear DiscriminantFisher’s Linear Discriminant
Linear classification model as dimensionality reduction from the D-dimensional space to one dimension. In case of two classes
Finding w such that the projected data are clustered well.
Ty w x0 1
2
if , then
otherwise,
y w C
C
x
x
12 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Linear Discriminant (Cont’d)Fisher’s Linear Discriminant (Cont’d)
Maximizing projected mean distance? The distance between the cluster means, m1 and m2 projected
onto w.
Not appropriate when the covariances are nondiagonal.
1 2
1 21 2
1 1 and n n
n C n CN N
m x m x T
2 1 2 1m m w m m
13 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Linear Discriminant (Cont’d)Fisher’s Linear Discriminant (Cont’d)
Integrate the within-class variance of the projected data. Finding w that maximizes J(w).
J(w) is maximized when
Fisher’s linear discriminant If the within-class covariance is isotropic, w is proportional to the d
ifference of the class means as in the previous case.
2
22 1 22 2
2
, where k
k n ki n C
m mJ s y m
s s
w
T
TB
W
J w S w
ww S w
1 2
T2 1 2 1
T T1 1 2 2
B
W n n n nn C n C
S m m m m
S x m x m x m x m
SB: Between-class covariance matrix
SW: Within-class covariance matrix
12 1W
w S m m
T TB W W Bw S w S w w S w S w in the direction
of (m2-m1)
14 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Linear DiscriminantFisher’s Linear Discriminant-Relation to Least Squares--Relation to Least Squares- Fisher criterion as a special case of least squares
When setting target values as: N/N1 for class C1 and N/N2 for class C2.
2T0
1
1
2
N
n nn
E w t
w x
T0
1
T0
1
0 (1)
0 (2)
N
n nn
N
n n nn
w t
w t
w x
w x x/ 0dE d w
0/ 0dE dw
by solving (1).
1 21 2 .W B
N NN
N S S w m m
T0 1 1 2 2
1
1 1, where
N
nn
w m N NN N
w m x m m
0by solving (2) with the above.w
11 2 .W
w S m m 2 1: always in the direction of B S w m m
15 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Discriminant for Multiple ClassesFisher’s Discriminant for Multiple Classes
K > 2 classes Dimension reduction from D to D’
D’ > 1 linear features, yk (k = 1,…,D’)
Generalization of SW and SB
Tk ky w x
T
1 1 1
1 1, where .
.
N N K
T n n n k kn n k
T W B
NN N
S x m x m m x m
S S S
T
1
T
1
1, where and .
k k
K
W k k n k n k k nkk n C n C
K
B k k kk
N
N
S S S x m x m m x
S m m m m
SB is from the decomposition of total covariance matrix (Duda and Hart, 1997)
16 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Discriminant for Multiple Classes Fisher’s Discriminant for Multiple Classes (Cont’d)(Cont’d)
Covariance matrices in the projected y-space
Fukunaga’s criterion Another criterion
Duda et al. ‘Pattern Classification’, Ch. 3.8.3 Determinant: the product of the eigenvalues, i.e. the variances in
the principal directions.
11 T TTr TrW B W BJ
W s s WS W WS W
T
1 1
1
and ,
1 1where and .
k
k
K KT
W k k k k B k k kk n C k
K
k n k kk n C k
N
NN N
s y μ y μ s μ μ μ μ
μ y μ μ
T
T=
BB
W W
J WS Ws
Ws WS W
17 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Fisher’s Discriminant for Multiple Classes Fisher’s Discriminant for Multiple Classes (Cont’d)(Cont’d)
18 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Perceptron AlgorithmPerceptron Algorithm
Classification of x by a perceptron
Error functions The total number of misclassified patterns
Piecewise constant and discontinuous gradient is zero almost everywhere.
Perceptron criterion.
T 1, 0, where .
1, 0
ay f f a
a
x w x
T , where is the target output.P n n nn M
E t t
w w
19 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Perceptron Algorithm (cont’d)Perceptron Algorithm (cont’d)
Stochastic gradient descent algorithm
The error from a misclassified pattern is reduced after each iteration. Not imply the overall error is reduced.
Perceptron convergence theorem. If there exists an exact solution (i.e. linear separable), the perceptron l
earning algorithm is guaranteed to find it. However…
Learning speed, linearly nonseparable, multiple classes
1P n nE t w w w w
T1 T T Tn n n n n n n n n nt t t t t w w w
20 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Perceptron Algorithm (cont’d)Perceptron Algorithm (cont’d)
(a) (b)
(c) (d)
21 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models
Computation of posterior probabilities using class-conditional densities and class priors.
Two classes
Generalization to K > 2 classes
| and |k k kp C p C p Cx x
1 11
1 1 2 2
||
| |
1
1 exp
p C p Cp C
p C p C p C p C
aa
xx
x x
1 1
2 2
|where ln .
|
p C p Ca
p C p C
x
x
| exp| ,
| exp
where ln | .
k k kk
j j jj j
k k k
p C p C ap C
p C p C a
a p C p C
x
xx
x
The normalized exponential is also known as the softmax function, i.e. smoothed version of the ‘max’ function.
22 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Continuous Inputs--Continuous Inputs-
Posterior probabilities when the class-conditional densities are Gaussian. When sharing the same covariance matrix ∑,
Two classes
The quadratic terms in x from the exponents are cancelled. The resulting decision boundary is linear in input space. The prior only shifts the decision boundary, i.e. parallel
contour.
T 1/ 2 1/ 2
1 1 1| exp .
22k k kD
p C
x x μ x μ
T1 0
11 T 1 T 11 2 0 1 1 2 2
2
|
1 1 and ln
2 2
p C w
p Cw
p C
x w x
w μ μ μ μ μ μ
| kp Cx
1 |p C x
23 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Continuous Inputs (cont’d)--Continuous Inputs (cont’d)- Generalization to K classes
When sharing the same covariance matrix, the decision boundaries are linear again.
If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant.
T0
1 T 10
1 and ln
2
k k k
k k k k k k
a w
w p C
x w x
w μ μ μ
24 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution--Maximum Likelihood Solution-
Determining the parameters for using maximum likelihood from a training data set.
Two classes
The likelihood function
| and k kp C p Cx
1 1 1 1
2 2 2 2
, | | ,
, | 1 | ,
n n n
n n n
p C p C p C N
p C p C p C N
x x x μ
x x x μ
Data set: , , 1,...,n nt n Nx 1 2Priors: and 1p C p C
11 2 1 2
1
| , , , | , 1 | ,n nN
t tn n
n
p N N
t x μ μ x μ x μ
1 21 or 0, (denoting and , respectively)nt C C
T1,..., Nt tt
25 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution (cont’d)--Maximum Likelihood Solution (cont’d)-
Two classes (cont’d) Maximization of the likelihood with respect to π.
Terms of the log likelihood that depend on π. Setting the derivative with respect to π equal to zero.
Maximization with respect to μ1.
1
ln 1 ln 1N
n nn
t t
1 1
1 21
1 N
nn
N Nt
N N N N
T 11 1 1
1 1
1ln | , const.
2
N N
n n n n nn n
t N t
x μ x μ x μ
11 1
1 N
n nn
tN
μ x 22 1
11
N
n nn
tN
μ xand analogously
26 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution (cont’d)--Maximum Likelihood Solution (cont’d)-
Two classes (cont’d) Maximization of the likelihood with respect to the shared
covariance matrix ∑.
T 11 1
1 1
T 12 2
1 1
1
1 1
2 2
1 11 1
2 2
ln Tr2 2
N N
n n n nn n
N N
n n n nn n
t t
t t
N N
x μ x μ
x μ x μ
S
1 21 2
T1
k
k n k n kk n C
N N
N N
N
S S S
S x μ x μ
SWeighted average of the covariance matrices associated with each classes.
But not robust to outliers.
27 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Discrete Features--Discrete Features-
Discrete feature values General distribution would correspond to a 2D size table.
When we have D inputs, the table size grows exponentially with the number of features.
Naïve Bayes assumption, conditioned on the class Ck
Linear with respect to the features as in the continuous features.
11
| 1 ii
Dxx
k kikii
p C
x
0,1ix
1
ln | ln 1 ln 1 lnD
k k i ki i ki ki
p C p C x x p C
x
28 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Bayes Decision Boundaries: 2DBayes Decision Boundaries: 2D-Pattern Classification, Duda et al. pp.42-Pattern Classification, Duda et al. pp.42
29 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Bayes Decision Boundaries: 3DBayes Decision Boundaries: 3D-Pattern Classification, Duda et al. pp.43-Pattern Classification, Duda et al. pp.43
30 (C) 2006, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Probabilistic Generative ModelsProbabilistic Generative Models-Exponential Family--Exponential Family- For both Gaussian distributed and discrete inputs…
The posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functio
ns. Generalization to the class-conditional densities of the exponential family
The subclass for which u(x) = x.
Linear with respect to x again.
T
For some scaling parameter ,
1 1 1| , exp .k k k
s
p s h gs s s
x λ x λ λ x T| expk k kp h gx λ x λ λ u x
T1 2 1 2 1 2ln ln ln lna g g p C p C x λ λ x λ λ
Exponential family
T ln lnk k k ka g p C x λ x λ
Two-classes
K-classes
exp
where | .exp
kk
jj
ap C
a
x
1 1| .p C ax