Linear Models for Classiﬁcation - Simon Fraser...

Discriminant Functions Generative Models Discriminative Models

Linear Models for ClassificationOliver Schulte - CMPT 726

Bishop PRML Ch. 4

Classification: Hand-written Digit Recognition

!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334

7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(

8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6

C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"

/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:

8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/

(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6

!"# $%&'() *+,-. */0+12.33. 4,3,5,6.

N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/

+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI

*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:

=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/

1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!

-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'

!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997

:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ

?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6

:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6

:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6

ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes• Denote classes by Ck

• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10

• Given a training set {(x1, t1), . . . , (xN , tN)}, learning problemis to construct a “good” function y(x) from these.• y : R784 → R10

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f (wTx + w0)

• This is called a generalized linear model• f (·) is a fixed non-linear function

• e.g.

f (u) ={

1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in φi(x) for regression

y(x) = f (wTx + w0)

• e.g.

f (u) ={

y(x) = f (wTx + w0)

• e.g.

f (u) ={

Overview

• Linear regression for Classification• The Fisher Linear Discriminant, or How to Draw a Line

Between Classes• The Perceptron, or The Smallest Neural Net• Logistic Regression—The Statistician’s Classifier

Outline

Discriminant Functions

Generative Models

Discriminative Models

Discriminant Functions with Two Classes

y(x)‖w‖

−w0‖w‖

y = 0y < 0

• Start with 2 class problem,ti ∈ {0, 1}

• Simple linear discriminant

y(x) = wTx + w0

apply threshold function to getclassification

• Decision surface is line;orthogonal to w.

• Projection of x in w dir. is wT x||w||

Multiple Classes

not C1

not C2

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Multiple Classes

not C1

not C2

between all pairs

Multiple Classes

not C1

not C2

between all pairs

Multiple Classes

not C1

not C2

between all pairs

Multiple Classes

not C1

not C2

between all pairs

Multiple Classes

not C1

not C2

between all pairs

Multiple Classes

• A solution is to build K linear functions:

yk(x) = wTk x + wk0

assign x to class maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j 6= k

Multiple Classes

• A solution is to build K linear functions:

yk(x) = wTk x + wk0

assign x to class maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j 6= k

Least Squares for Classification

• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all

components of the label vector:

E(W) =12

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

E(W) =12

N∑n=1

K∑k=1

(yk(xn)− tnk)2

E(W) =12

N∑n=1

K∑k=1

(yk(xn)− tnk)2

Problems with Least Squares

−4 −2 0 2 4 6 8

• Looks okay... least squaresdecision boundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved

−4 −2 0 2 4 6 8

• Why?

• If target value is 1, points farfrom boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved

−4 −2 0 2 4 6 8

Linear Models for Classiﬁcation - Simon Fraser...

Documents

Transcript of Linear Models for Classiﬁcation - Simon Fraser...