Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf ·...

35
Chapter 4: Linear Models for Classifica6on Pa8ern Recogni6on and Machine Learning

Transcript of Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf ·...

Page 1: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Chapter  4:  Linear  Models  for  Classifica6on  

Pa8ern  Recogni6on    and  Machine  Learning  

Page 2: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Represen'ng  the  target  values  for  classifica'on  

•  If  there  are  only  two  classes,  we  typically  use  a  single  real  valued  output  that  has  target  values  of  1  for  the  “posi've”  class  and  0  (or  some'mes  -­‐1)  for  the  other  class  

•  For  probabilis'c  class  labels  the  output  of  the  model  can    represent  the  probability  the  model  gives  to  the  posi've  class.  

•  If  there  are  N  classes  we  oDen  use  a  vector  of  N  target  values  containing  a  single  1  for  the  correct  class  and  zeros  elsewhere.  

•  For  probabilis'c  labels  we  can  then  use  a  vector  of  class  probabili'es  as  the  target  vector.  

 

Page 3: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Three  approaches  to  classifica'on  •  Use  discriminant  func'ons  directly  without  probabili'es:  

•  Convert  the  input  vector  into  one  or  more  real  values  so  that  a  simple  opera'on  (like  threshholding)  can  be  applied  to  get  the  class.  

•  Infer  condi'onal  class  probabili'es:  •  Then  make  a  decision  that  minimizes  some  loss  func'on  

•  Compare  the  probability  of  the  input  under  separate,  class-­‐specific,  genera've  models.  •  E.g.  fit  a  mul'variate  Gaussian  to  the  input  vectors  of  

each  class  and  see  which  Gaussian  makes  a  test  data  vector  most  probable.    

 

)|( xkCclassp =

Page 4: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

The planar decision surface in data-space for the simple linear discriminant function:

y = w

Tx+ w0 � 0

Discriminant  func'ons  

Page 5: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Discriminant  func'ons  for  N>2  classes  

•  One  possibility  is  to  use  N  two-­‐way  discriminant  func'ons.  •  Each  func'on  discriminates  one  class  from  the  rest.  

•  Another  possibility  is  to  use  N(N-­‐1)/2  two-­‐way  discriminant  func'ons  •  Each  func'on  discriminates  between  two  par'cular  

classes.  

•  Both  these  methods  have  problems    

Page 6: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Problems  with  mul'-­‐class  discriminant  func'ons  

Page 7: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

A  simple  solu'on  

•  Use K discriminant functions, and pick the max. –  This is guaranteed to

give consistent and convex decision regions if y is linear.

( ) ( )BAjBAk

BjBkAjAk

yythatpositiveforimplies

yyandyy

xxxx

xxxx

)1()1()(

)()()()(

ααααα

−+>−+

>>

...,, kji yyy

Page 8: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Using  “least  squares”  for  classifica'on  

This  is  not  the  right  thing  to  do  and  it  doesn’t  work  as  well  as  other  methods  (we  will  see  later),  but  it  is  easy.  It  reduces  classifica'on  to  least  squares  regression.  We  already  know  how  to  do  regression.  We  can  just  solve  for  the  op'mal  weights  with  some  matrix  algebra.    

 

Page 9: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Least  squares  for  classifica6on  

The sum-of-squares error function: nth row of matrix T is tT

n

{xn, tn}

Page 10: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Problems  with  using  least  squares  for  classifica'on  

If the right answer is 1 and the model says 1.5, it loses, so it changes the boundary to avoid being “too correct”

logistic regression

least squares regression

Page 11: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Another  example  where  least  squares  regression  gives  poor  decision  surfaces  

Page 12: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Fisher’s  linear  discriminant  

•  A  simple  linear  discriminant  func'on  is  a  projec'on  of  the  data  down  to  1-­‐D.  

•  So  choose  the  projec'on  that  gives  the  best  separa'on  of  the  classes.  What  do  we  mean  by  “best  separa'on”?  

•  An  obvious  direc'on  to  choose  is  the  direc'on  of  the  line  joining  the  class  means.  

•  But  if  the  main  direc'on  of  variance  in  each  class  is  not  orthogonal  to  this  line,  this  will  not  give  good  separa'on.  

•  Fisher’s  method  chooses  the  direc'on  that  maximizes  the  ra'o  of  between  class  variance  to  within  class  variance.  

•  This  is  the  direc'on  in  which  the  projected  points  contain  the  most  informa'on  about  class  membership  (under  Gaussian  assump'ons)  

Page 13: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Fisher’s  linear  discriminant  

The  two-­‐class  linear  discriminant  acts  as  a  projec'on:        followed  by  a  threshold  

In  which  direc'on  w  should  we  project?  One  which  separates  classes  “well”  Intui'on:  We  want  the  (projected)  centers  of  the  classes  to  be  far  apart,  and  each  class  (projec'on)  to  be  clustered  around  its  centre.  

y � �w0

y = w

Tx

Page 14: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Fisher’s  linear  discriminant  

A  natural  idea  would  be  to  project  in  the  direc'on  of  the  line  connec'ng  class  means  However,  problema'c  if  classes  have  variance  in  this  direc'on  (ie  are  not  clustered  around  the  mean)  Fisher  criterion:  maximize  ra'o  of  inter-­‐class  separa'on  (between)  to  intra-­‐class  variance  (inside)  

Page 15: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

A  picture  showing  the  advantage  of  Fisher’s  linear  discriminant.  

When projected onto the line joining the class means, the classes are not well separated.

Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.

Page 16: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Math  of  Fisher’s  linear  discriminants  

What  linear  transforma'on  is  best  for  discrimina'on?  The  projec'on  onto  the  vector  separa'ng  the  class  means  seems  sensible.    But  we  also  want  small  variance  within  each  class:          Fisher’s  objec've  func'on  is:  

xwTy =

)(

)(

222

121

2

1

mys

mys

Cnn

Cnn

−=

−=

ε

ε

22

21

212 )()(ssmmJ

+−=w

between

within

Page 17: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

)(:

)()()()(

)()(

)()(

121

2211

1212

22

21

212

21

mmSw

mxmxmxmxS

mmmmS

wSwwSww

−∝

−−+−−=

−−=

=+−=

∈∈∑∑

W

Cn

Tnn

Cn

TnnW

TB

WT

BT

solutionOptimal

ssmmJ

More  math  of  Fisher’s  linear  discriminants  

Page 18: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Probabilis6c  Genera6ve  Models  

Model the class-conditional densities and class prior Use Bayes’ theorem to compute posterior probabilities

Page 19: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Logistic Sigmoid Function

Page 20: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Probabilis6c  Genera6ve  Models  (k  class)  

•  Beyes’ theorem

•  if ak aj for all j ≠ k, then p(Ck|x) ≈1, and p(Cj |x)≈ 0.

Page 21: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Continuous inputs

Model the class-conditional densities Here, all classes share the same covariance matrix. Two-classes Problem

Page 22: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Two-Classes Problem

Page 23: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

The  posterior  when  the  covariance  matrices  are  different  for  different  classes.  

The decision surface is planar when the covariance matrices are the same and quadratic when they are not.

Page 24: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Maximum  likelihood  solu6on  �   The  likelihood  func'on:  

       

 

Page 25: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Maximum  likelihood  solu6on  

Note  that  the  approach  of  fi`ng  Gaussian  distribu'ons  to  the  classes  is  not  robust  to  outliers,  because  the  maximum  likelihood  es*ma*on  of  a  Gaussian  is  not  robust.      

Page 26: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Discrete features

When The class-conditional distributions: Linear functions of the input values:

Page 27: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

Probabilistic Discriminative Models

What is it? It means to use the functional form of the generalized linear probabilistic model explicitly, and to determine its parameters directly

The difference from Discriminant Functions, and Probabilistic Generative Models

27  

Page 28: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

28

Fixed Basis Functions

Space transformation by nonlinear function

Page 29: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

29

Logistic Regression

•  To reduce the complexity of probabilistic models, the logistic sigmoid is introduced

–  For Gaussian class conditional densities: M(M+5)/2+1 parameters

–  Logistic Regression:

M parameters

Page 30: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

30

The Likelihood

p(t|w) = ⇧Nn=1y

tnn {1� yn}1�tn

yn = p(C1|�n)

Page 31: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

31

The Gradient of the Log Likelihood

•  Error function:

•  Gradient of the Error function

Page 32: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

32

Iterative Reweighted Least Squares

•  Newton-Raphson

•  H is the Hessian Matrix

•  R is an NxN diagonal matrix with elements: H = �TR�

Rnn = yn(1� yn)

Page 33: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

33

Iterative Reweighted Least Squares

•  The Newton-Raphson update takes this form:

z = �w(old) �R�1(y � t)

Page 34: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

34

Multiclass Logistic Regression

•  Generative models (discussed in section 4.2)

where

•  Using the maximum likelihood to determine the parameters directly

Page 35: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning

35

Multiclass Logistic Regression

•  Maximum likelihood method

•  Using IRLS algorithm to do the w updating work