Summarized and revised by Hee-Woong Lim

Ch 4. Linear Models for Ch 4. Linear Models for Classification (1/2)Classification (1/2)

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized and revised by

Hee-Woong Lim

2 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

4.1. Discriminant Functions

4.2. Probabilistic Generative Models



Classification ModelsClassification Models

Linear classification model (D-1)-dimensional hyperplane for D-dimensional input space 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0)T

Discriminant function Directly assigns each vector x to a specific class. ex. Fishers linear discriminant

Approaches using conditional probability Separation of inference and decision states Two approaches

Direct modeling of the posterior probability Generative approach

– Modeling likelihood and prior probability to calculate the posterior probability

– Capable of generating samples

|kp C x



Discriminant Functions-Two ClassesDiscriminant Functions-Two Classes

Classification by hyperplanes

or

T0

1

2

if 0,

otherwise,

y w

y C

C

x w x

x x

x

T

0where , and 1,

y

w

x w x

w w x x



Discriminant Functions-Multiple ClassesDiscriminant Functions-Multiple Classes

One-versus-the-rest classifier K-1 classifiers for a K-class discriminant Ambiguous when more than two classifiers say ‘yes’.

One-versus-one classifier K(K-1)/2 binary discriminant functions Majority voting ambiguousness with equal scores

One-versus-the-rest One-versus-one



Discriminant Functions-Multiple Classes Discriminant Functions-Multiple Classes (Cont’d)(Cont’d) K-class discriminant comprising K linear functions

Assigns x to the corresponding class having the maximum output.

The decision regions are always singly connected and convex.

T0 , 1,...,

if for

k k k

k k j

y w k K

C y y j k

x w x

x x x

ˆFor , , let 1 .

ˆThen 1 .

and for ,

ˆ ˆtherefore for .

A B k A B

k k A k B

k A j A k B j B

k j

C

y y y

y y y y j k

y y j k

x x x x x

x x x

x x x x

x x



Approaches for Learning ParametersApproaches for Learning Parametersfor Linear Discriminant Functionsfor Linear Discriminant Functions

Least square method Fisher’s linear discriminant

Relation to least squares Multiple classes

Perceptron algorithm



Least Square MethodLeast Square Method

Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t.

For a training data set, {xn, tn} where n = 1,…,N.The sum of squares error function is…

Minimizing SSE gives

Ty x W x TT1 2 0,where ... and .K k k kw W w w w w w

T

T T1 2 1 2

1Tr ,

2

where ... and ... .

D

N N

E

W XW T XW T

X x x x T t t t

1T T . W X X X T X T Pseudo inverse



Least Square Method (Cont’d)Least Square Method (Cont’d)-Limit and Disadvantage-Limit and Disadvantage The least-squares solutions yields y(x) whose elements sum to 1, bu

t do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers

Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary.

ML under Gaussian conditional distribution Unimodal vs. multimodal



Least Square Method (Cont’d)Least Square Method (Cont’d)-Limit and Disadvantage-Limit and Disadvantage

Lack of robustness comes from… Least square method corresponds to the maximum likelihood

under the assumption of Gaussian distribution. Binary target vectors are far from this assumption.

Least square solution Logistic regression



Fisher’s Linear DiscriminantFisher’s Linear Discriminant

Linear classification model as dimensionality reduction from the D-dimensional space to one dimension. In case of two classes

Finding w such that the projected data are clustered well.

Ty w x0 1

2

if , then

otherwise,

y w C

C

x

x



Fisher’s Linear Discriminant (Cont’d)Fisher’s Linear Discriminant (Cont’d)

Maximizing projected mean distance? The distance between the cluster means, m1 and m2 projected

onto w.

Not appropriate when the covariances are nondiagonal.

1 2

1 21 2

1 1 and n n

n C n CN N

m x m x T

2 1 2 1m m w m m



Fisher’s Linear Discriminant (Cont’d)Fisher’s Linear Discriminant (Cont’d)

Integrate the within-class variance of the projected data. Finding w that maximizes J(w).

J(w) is maximized when

Fisher’s linear discriminant If the within-class covariance is isotropic, w is proportional to the d

ifference of the class means as in the previous case.

2

22 1 22 2

2

, where k

k n ki n C

m mJ s y m

s s

w

T

TB

W

J w S w

ww S w

1 2

T2 1 2 1

T T1 1 2 2

B

W n n n nn C n C

S m m m m

S x m x m x m x m

SB: Between-class covariance matrix

SW: Within-class covariance matrix

12 1W

w S m m

T TB W W Bw S w S w w S w S w in the direction

of (m2-m1)



Fisher’s Linear DiscriminantFisher’s Linear Discriminant-Relation to Least Squares--Relation to Least Squares- Fisher criterion as a special case of least squares

When setting target values as: N/N1 for class C1 and N/N2 for class C2.

2T0

1

1

2

N

n nn

E w t

w x

T0

1

T0

1

0 (1)

0 (2)

N

n nn

N

n n nn

w t

w t

w x

w x x/ 0dE d w

0/ 0dE dw

by solving (1).

1 21 2 .W B

N NN

N S S w m m

T0 1 1 2 2

1

1 1, where

N

nn

w m N NN N

w m x m m

0by solving (2) with the above.w

11 2 .W

w S m m 2 1: always in the direction of B S w m m



Fisher’s Discriminant for Multiple ClassesFisher’s Discriminant for Multiple Classes

K > 2 classes Dimension reduction from D to D’

D’ > 1 linear features, yk (k = 1,…,D’)

Generalization of SW and SB

Tk ky w x

T

1 1 1

1 1, where .

.

N N K

T n n n k kn n k

T W B

NN N

S x m x m m x m

S S S

T

1

T

1

1, where and .

k k

K

W k k n k n k k nkk n C n C

K

B k k kk

N

N

S S S x m x m m x

S m m m m

SB is from the decomposition of total covariance matrix (Duda and Hart, 1997)



Fisher’s Discriminant for Multiple Classes Fisher’s Discriminant for Multiple Classes (Cont’d)(Cont’d)

Covariance matrices in the projected y-space

Fukunaga’s criterion Another criterion

Duda et al. ‘Pattern Classification’, Ch. 3.8.3 Determinant: the product of the eigenvalues, i.e. the variances in

the principal directions.

11 T TTr TrW B W BJ

W s s WS W WS W

T

1 1

1

and ,

1 1where and .

k

k

K KT

W k k k k B k k kk n C k

K

k n k kk n C k

N

NN N

s y μ y μ s μ μ μ μ

μ y μ μ

T

T=

BB

W W

J WS Ws

Ws WS W



Fisher’s Discriminant for Multiple Classes Fisher’s Discriminant for Multiple Classes (Cont’d)(Cont’d)



Perceptron AlgorithmPerceptron Algorithm

Classification of x by a perceptron

Error functions The total number of misclassified patterns

Piecewise constant and discontinuous gradient is zero almost everywhere.

Perceptron criterion.

T 1, 0, where .

1, 0

ay f f a

a

x w x

T , where is the target output.P n n nn M

E t t

w w



Perceptron Algorithm (cont’d)Perceptron Algorithm (cont’d)

Stochastic gradient descent algorithm

The error from a misclassified pattern is reduced after each iteration. Not imply the overall error is reduced.

Perceptron convergence theorem. If there exists an exact solution (i.e. linear separable), the perceptron l

earning algorithm is guaranteed to find it. However…

Learning speed, linearly nonseparable, multiple classes

1P n nE t w w w w

T1 T T Tn n n n n n n n n nt t t t t w w w



Perceptron Algorithm (cont’d)Perceptron Algorithm (cont’d)

(a) (b)

(c) (d)



Probabilistic Generative ModelsProbabilistic Generative Models

Computation of posterior probabilities using class-conditional densities and class priors.

Two classes

Generalization to K > 2 classes

| and |k k kp C p C p Cx x

1 11

1 1 2 2

||

| |

1

1 exp

p C p Cp C

p C p C p C p C

aa

xx

x x

1 1

2 2

|where ln .

|

p C p Ca

p C p C

x

x

| exp| ,

| exp

where ln | .

k k kk

j j jj j

k k k

p C p C ap C

p C p C a

a p C p C

x

xx

x

The normalized exponential is also known as the softmax function, i.e. smoothed version of the ‘max’ function.



Probabilistic Generative ModelsProbabilistic Generative Models-Continuous Inputs--Continuous Inputs-

Posterior probabilities when the class-conditional densities are Gaussian. When sharing the same covariance matrix ∑,

Two classes

The quadratic terms in x from the exponents are cancelled. The resulting decision boundary is linear in input space. The prior only shifts the decision boundary, i.e. parallel

contour.

T 1/ 2 1/ 2

1 1 1| exp .

22k k kD

p C

x x μ x μ

T1 0

11 T 1 T 11 2 0 1 1 2 2

2

|

1 1 and ln

2 2

p C w

p Cw

p C

x w x

w μ μ μ μ μ μ

| kp Cx

1 |p C x



Probabilistic Generative ModelsProbabilistic Generative Models-Continuous Inputs (cont’d)--Continuous Inputs (cont’d)- Generalization to K classes

When sharing the same covariance matrix, the decision boundaries are linear again.

If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant.

T0

1 T 10

1 and ln

2

k k k

k k k k k k

a w

w p C

x w x

w μ μ μ



Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution--Maximum Likelihood Solution-

Determining the parameters for using maximum likelihood from a training data set.

Two classes

The likelihood function

| and k kp C p Cx

1 1 1 1

2 2 2 2

, | | ,

, | 1 | ,

n n n

n n n

p C p C p C N

p C p C p C N

x x x μ

x x x μ

Data set: , , 1,...,n nt n Nx 1 2Priors: and 1p C p C

11 2 1 2

1

| , , , | , 1 | ,n nN

t tn n

n

p N N

t x μ μ x μ x μ

1 21 or 0, (denoting and , respectively)nt C C

T1,..., Nt tt



Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution (cont’d)--Maximum Likelihood Solution (cont’d)-

Two classes (cont’d) Maximization of the likelihood with respect to π.

Terms of the log likelihood that depend on π. Setting the derivative with respect to π equal to zero.

Maximization with respect to μ1.

1

ln 1 ln 1N

n nn

t t

1 1

1 21

1 N

nn

N Nt

N N N N

T 11 1 1

1 1

1ln | , const.

2

N N

n n n n nn n

t N t

x μ x μ x μ

11 1

1 N

n nn

tN

μ x 22 1

11

N

n nn

tN

μ xand analogously



Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution (cont’d)--Maximum Likelihood Solution (cont’d)-

Two classes (cont’d) Maximization of the likelihood with respect to the shared

covariance matrix ∑.

T 11 1

1 1

T 12 2

1 1

1

1 1

2 2

1 11 1

2 2

ln Tr2 2

N N

n n n nn n

N N

n n n nn n

t t

t t

N N

x μ x μ

x μ x μ

S

1 21 2

T1

k

k n k n kk n C

N N

N N

N

S S S

S x μ x μ

SWeighted average of the covariance matrices associated with each classes.

But not robust to outliers.



Probabilistic Generative ModelsProbabilistic Generative Models-Discrete Features--Discrete Features-

Discrete feature values General distribution would correspond to a 2D size table.

When we have D inputs, the table size grows exponentially with the number of features.

Naïve Bayes assumption, conditioned on the class Ck

Linear with respect to the features as in the continuous features.

11

| 1 ii

Dxx

k kikii

p C

x

0,1ix

1

ln | ln 1 ln 1 lnD

k k i ki i ki ki

p C p C x x p C

x



Bayes Decision Boundaries: 2DBayes Decision Boundaries: 2D-Pattern Classification, Duda et al. pp.42-Pattern Classification, Duda et al. pp.42



Bayes Decision Boundaries: 3DBayes Decision Boundaries: 3D-Pattern Classification, Duda et al. pp.43-Pattern Classification, Duda et al. pp.43



Probabilistic Generative ModelsProbabilistic Generative Models-Exponential Family--Exponential Family- For both Gaussian distributed and discrete inputs…

The posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functio

ns. Generalization to the class-conditional densities of the exponential family

The subclass for which u(x) = x.

Linear with respect to x again.

T

For some scaling parameter ,

1 1 1| , exp .k k k

s

p s h gs s s

x λ x λ λ x T| expk k kp h gx λ x λ λ u x

T1 2 1 2 1 2ln ln ln lna g g p C p C x λ λ x λ λ

Exponential family

T ln lnk k k ka g p C x λ x λ

Two-classes

K-classes

exp

where | .exp

kk

jj

ap C

a

x

1 1| .p C ax

Summarized and revised by Hee-Woong Lim

Documents

Transcript of Summarized and revised by Hee-Woong Lim