Summarized and revised by Hee-Woong Lim

30
Ch 4. Linear Models for Ch 4. Linear Models for Classification (1/2) Classification (1/2) Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006. C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim

description

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim. Contents. 4.1. Discriminant Functions 4.2. Probabilistic Generative Models. Classification Models. Linear classification model - PowerPoint PPT Presentation

Transcript of Summarized and revised by Hee-Woong Lim

Page 1: Summarized and revised by  Hee-Woong Lim

Ch 4. Linear Models for Ch 4. Linear Models for Classification (1/2)Classification (1/2)

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized and revised by

Hee-Woong Lim

Page 2: Summarized and revised by  Hee-Woong Lim

2 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

4.1. Discriminant Functions

4.2. Probabilistic Generative Models

Page 3: Summarized and revised by  Hee-Woong Lim

3 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Classification ModelsClassification Models

Linear classification model (D-1)-dimensional hyperplane for D-dimensional input space 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0)T

Discriminant function Directly assigns each vector x to a specific class. ex. Fishers linear discriminant

Approaches using conditional probability Separation of inference and decision states Two approaches

Direct modeling of the posterior probability Generative approach

– Modeling likelihood and prior probability to calculate the posterior probability

– Capable of generating samples

|kp C x

Page 4: Summarized and revised by  Hee-Woong Lim

4 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Discriminant Functions-Two ClassesDiscriminant Functions-Two Classes

Classification by hyperplanes

or

T0

1

2

if 0,

otherwise,

y w

y C

C

x w x

x x

x

T

0where , and 1,

y

w

x w x

w w x x

Page 5: Summarized and revised by  Hee-Woong Lim

5 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Discriminant Functions-Multiple ClassesDiscriminant Functions-Multiple Classes

One-versus-the-rest classifier K-1 classifiers for a K-class discriminant Ambiguous when more than two classifiers say ‘yes’.

One-versus-one classifier K(K-1)/2 binary discriminant functions Majority voting ambiguousness with equal scores

One-versus-the-rest One-versus-one

Page 6: Summarized and revised by  Hee-Woong Lim

6 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Discriminant Functions-Multiple Classes Discriminant Functions-Multiple Classes (Cont’d)(Cont’d) K-class discriminant comprising K linear functions

Assigns x to the corresponding class having the maximum output.

The decision regions are always singly connected and convex.

T0 , 1,...,

if for

k k k

k k j

y w k K

C y y j k

x w x

x x x

ˆFor , , let 1 .

ˆThen 1 .

and for ,

ˆ ˆtherefore for .

A B k A B

k k A k B

k A j A k B j B

k j

C

y y y

y y y y j k

y y j k

x x x x x

x x x

x x x x

x x

Page 7: Summarized and revised by  Hee-Woong Lim

7 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Approaches for Learning ParametersApproaches for Learning Parametersfor Linear Discriminant Functionsfor Linear Discriminant Functions

Least square method Fisher’s linear discriminant

Relation to least squares Multiple classes

Perceptron algorithm

Page 8: Summarized and revised by  Hee-Woong Lim

8 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Least Square MethodLeast Square Method

Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t.

For a training data set, {xn, tn} where n = 1,…,N.The sum of squares error function is…

Minimizing SSE gives

Ty x W x TT1 2 0,where ... and .K k k kw W w w w w w

T

T T1 2 1 2

1Tr ,

2

where ... and ... .

D

N N

E

W XW T XW T

X x x x T t t t

1T T . W X X X T X T Pseudo inverse

Page 9: Summarized and revised by  Hee-Woong Lim

9 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Least Square Method (Cont’d)Least Square Method (Cont’d)-Limit and Disadvantage-Limit and Disadvantage The least-squares solutions yields y(x) whose elements sum to 1, bu

t do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers

Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary.

ML under Gaussian conditional distribution Unimodal vs. multimodal

Page 10: Summarized and revised by  Hee-Woong Lim

10 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Least Square Method (Cont’d)Least Square Method (Cont’d)-Limit and Disadvantage-Limit and Disadvantage

Lack of robustness comes from… Least square method corresponds to the maximum likelihood

under the assumption of Gaussian distribution. Binary target vectors are far from this assumption.

Least square solution Logistic regression

Page 11: Summarized and revised by  Hee-Woong Lim

11 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Linear DiscriminantFisher’s Linear Discriminant

Linear classification model as dimensionality reduction from the D-dimensional space to one dimension. In case of two classes

Finding w such that the projected data are clustered well.

Ty w x0 1

2

if , then

otherwise,

y w C

C

x

x

Page 12: Summarized and revised by  Hee-Woong Lim

12 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Linear Discriminant (Cont’d)Fisher’s Linear Discriminant (Cont’d)

Maximizing projected mean distance? The distance between the cluster means, m1 and m2 projected

onto w.

Not appropriate when the covariances are nondiagonal.

1 2

1 21 2

1 1 and n n

n C n CN N

m x m x T

2 1 2 1m m w m m

Page 13: Summarized and revised by  Hee-Woong Lim

13 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Linear Discriminant (Cont’d)Fisher’s Linear Discriminant (Cont’d)

Integrate the within-class variance of the projected data. Finding w that maximizes J(w).

J(w) is maximized when

Fisher’s linear discriminant If the within-class covariance is isotropic, w is proportional to the d

ifference of the class means as in the previous case.

2

22 1 22 2

2

, where k

k n ki n C

m mJ s y m

s s

w

T

TB

W

J w S w

ww S w

1 2

T2 1 2 1

T T1 1 2 2

B

W n n n nn C n C

S m m m m

S x m x m x m x m

SB: Between-class covariance matrix

SW: Within-class covariance matrix

12 1W

w S m m

T TB W W Bw S w S w w S w S w in the direction

of (m2-m1)

Page 14: Summarized and revised by  Hee-Woong Lim

14 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Linear DiscriminantFisher’s Linear Discriminant-Relation to Least Squares--Relation to Least Squares- Fisher criterion as a special case of least squares

When setting target values as: N/N1 for class C1 and N/N2 for class C2.

2T0

1

1

2

N

n nn

E w t

w x

T0

1

T0

1

0 (1)

0 (2)

N

n nn

N

n n nn

w t

w t

w x

w x x/ 0dE d w

0/ 0dE dw

by solving (1).

1 21 2 .W B

N NN

N S S w m m

T0 1 1 2 2

1

1 1, where

N

nn

w m N NN N

w m x m m

0by solving (2) with the above.w

11 2 .W

w S m m 2 1: always in the direction of B S w m m

Page 15: Summarized and revised by  Hee-Woong Lim

15 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Discriminant for Multiple ClassesFisher’s Discriminant for Multiple Classes

K > 2 classes Dimension reduction from D to D’

D’ > 1 linear features, yk (k = 1,…,D’)

Generalization of SW and SB

Tk ky w x

T

1 1 1

1 1, where .

.

N N K

T n n n k kn n k

T W B

NN N

S x m x m m x m

S S S

T

1

T

1

1, where and .

k k

K

W k k n k n k k nkk n C n C

K

B k k kk

N

N

S S S x m x m m x

S m m m m

SB is from the decomposition of total covariance matrix (Duda and Hart, 1997)

Page 16: Summarized and revised by  Hee-Woong Lim

16 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Discriminant for Multiple Classes Fisher’s Discriminant for Multiple Classes (Cont’d)(Cont’d)

Covariance matrices in the projected y-space

Fukunaga’s criterion Another criterion

Duda et al. ‘Pattern Classification’, Ch. 3.8.3 Determinant: the product of the eigenvalues, i.e. the variances in

the principal directions.

11 T TTr TrW B W BJ

W s s WS W WS W

T

1 1

1

and ,

1 1where and .

k

k

K KT

W k k k k B k k kk n C k

K

k n k kk n C k

N

NN N

s y μ y μ s μ μ μ μ

μ y μ μ

T

T=

BB

W W

J WS Ws

Ws WS W

Page 17: Summarized and revised by  Hee-Woong Lim

17 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Fisher’s Discriminant for Multiple Classes Fisher’s Discriminant for Multiple Classes (Cont’d)(Cont’d)

Page 18: Summarized and revised by  Hee-Woong Lim

18 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Perceptron AlgorithmPerceptron Algorithm

Classification of x by a perceptron

Error functions The total number of misclassified patterns

Piecewise constant and discontinuous gradient is zero almost everywhere.

Perceptron criterion.

T 1, 0, where .

1, 0

ay f f a

a

x w x

T , where is the target output.P n n nn M

E t t

w w

Page 19: Summarized and revised by  Hee-Woong Lim

19 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Perceptron Algorithm (cont’d)Perceptron Algorithm (cont’d)

Stochastic gradient descent algorithm

The error from a misclassified pattern is reduced after each iteration. Not imply the overall error is reduced.

Perceptron convergence theorem. If there exists an exact solution (i.e. linear separable), the perceptron l

earning algorithm is guaranteed to find it. However…

Learning speed, linearly nonseparable, multiple classes

1P n nE t w w w w

T1 T T Tn n n n n n n n n nt t t t t w w w

Page 20: Summarized and revised by  Hee-Woong Lim

20 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Perceptron Algorithm (cont’d)Perceptron Algorithm (cont’d)

(a) (b)

(c) (d)

Page 21: Summarized and revised by  Hee-Woong Lim

21 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models

Computation of posterior probabilities using class-conditional densities and class priors.

Two classes

Generalization to K > 2 classes

| and |k k kp C p C p Cx x

1 11

1 1 2 2

||

| |

1

1 exp

p C p Cp C

p C p C p C p C

aa

xx

x x

1 1

2 2

|where ln .

|

p C p Ca

p C p C

x

x

| exp| ,

| exp

where ln | .

k k kk

j j jj j

k k k

p C p C ap C

p C p C a

a p C p C

x

xx

x

The normalized exponential is also known as the softmax function, i.e. smoothed version of the ‘max’ function.

Page 22: Summarized and revised by  Hee-Woong Lim

22 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Continuous Inputs--Continuous Inputs-

Posterior probabilities when the class-conditional densities are Gaussian. When sharing the same covariance matrix ∑,

Two classes

The quadratic terms in x from the exponents are cancelled. The resulting decision boundary is linear in input space. The prior only shifts the decision boundary, i.e. parallel

contour.

T 1/ 2 1/ 2

1 1 1| exp .

22k k kD

p C

x x μ x μ

T1 0

11 T 1 T 11 2 0 1 1 2 2

2

|

1 1 and ln

2 2

p C w

p Cw

p C

x w x

w μ μ μ μ μ μ

| kp Cx

1 |p C x

Page 23: Summarized and revised by  Hee-Woong Lim

23 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Continuous Inputs (cont’d)--Continuous Inputs (cont’d)- Generalization to K classes

When sharing the same covariance matrix, the decision boundaries are linear again.

If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant.

T0

1 T 10

1 and ln

2

k k k

k k k k k k

a w

w p C

x w x

w μ μ μ

Page 24: Summarized and revised by  Hee-Woong Lim

24 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution--Maximum Likelihood Solution-

Determining the parameters for using maximum likelihood from a training data set.

Two classes

The likelihood function

| and k kp C p Cx

1 1 1 1

2 2 2 2

, | | ,

, | 1 | ,

n n n

n n n

p C p C p C N

p C p C p C N

x x x μ

x x x μ

Data set: , , 1,...,n nt n Nx 1 2Priors: and 1p C p C

11 2 1 2

1

| , , , | , 1 | ,n nN

t tn n

n

p N N

t x μ μ x μ x μ

1 21 or 0, (denoting and , respectively)nt C C

T1,..., Nt tt

Page 25: Summarized and revised by  Hee-Woong Lim

25 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution (cont’d)--Maximum Likelihood Solution (cont’d)-

Two classes (cont’d) Maximization of the likelihood with respect to π.

Terms of the log likelihood that depend on π. Setting the derivative with respect to π equal to zero.

Maximization with respect to μ1.

1

ln 1 ln 1N

n nn

t t

1 1

1 21

1 N

nn

N Nt

N N N N

T 11 1 1

1 1

1ln | , const.

2

N N

n n n n nn n

t N t

x μ x μ x μ

11 1

1 N

n nn

tN

μ x 22 1

11

N

n nn

tN

μ xand analogously

Page 26: Summarized and revised by  Hee-Woong Lim

26 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution (cont’d)--Maximum Likelihood Solution (cont’d)-

Two classes (cont’d) Maximization of the likelihood with respect to the shared

covariance matrix ∑.

T 11 1

1 1

T 12 2

1 1

1

1 1

2 2

1 11 1

2 2

ln Tr2 2

N N

n n n nn n

N N

n n n nn n

t t

t t

N N

x μ x μ

x μ x μ

S

1 21 2

T1

k

k n k n kk n C

N N

N N

N

S S S

S x μ x μ

SWeighted average of the covariance matrices associated with each classes.

But not robust to outliers.

Page 27: Summarized and revised by  Hee-Woong Lim

27 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Discrete Features--Discrete Features-

Discrete feature values General distribution would correspond to a 2D size table.

When we have D inputs, the table size grows exponentially with the number of features.

Naïve Bayes assumption, conditioned on the class Ck

Linear with respect to the features as in the continuous features.

11

| 1 ii

Dxx

k kikii

p C

x

0,1ix

1

ln | ln 1 ln 1 lnD

k k i ki i ki ki

p C p C x x p C

x

Page 28: Summarized and revised by  Hee-Woong Lim

28 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bayes Decision Boundaries: 2DBayes Decision Boundaries: 2D-Pattern Classification, Duda et al. pp.42-Pattern Classification, Duda et al. pp.42

Page 29: Summarized and revised by  Hee-Woong Lim

29 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bayes Decision Boundaries: 3DBayes Decision Boundaries: 3D-Pattern Classification, Duda et al. pp.43-Pattern Classification, Duda et al. pp.43

Page 30: Summarized and revised by  Hee-Woong Lim

30 (C) 2006, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Exponential Family--Exponential Family- For both Gaussian distributed and discrete inputs…

The posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functio

ns. Generalization to the class-conditional densities of the exponential family

The subclass for which u(x) = x.

Linear with respect to x again.

T

For some scaling parameter ,

1 1 1| , exp .k k k

s

p s h gs s s

x λ x λ λ x T| expk k kp h gx λ x λ λ u x

T1 2 1 2 1 2ln ln ln lna g g p C p C x λ λ x λ λ

Exponential family

T ln lnk k k ka g p C x λ x λ

Two-classes

K-classes

exp

where | .exp

kk

jj

ap C

a

x

1 1| .p C ax