Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

download Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

of 45

Transcript of Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    1/45

    SupportVectorMachines

    (andKernelMethodsingeneral)

    MachineLearning

    March23,2010

    1

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    2/45

    LastTime

    Mul?layerPerceptron/Logis?cRegressionNetworks

    NeuralNetworks

    ErrorBackpropaga?on

    2

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    3/45

    Today

    SupportVectorMachines Note:wellrelyonsomemathfromOp?malityTheorythatwewontderive.

    3

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    4/45

    MaximumMargin

    Perceptron(andotherlinearclassifiers)canleadtomanyequallyvalidchoicesforthedecisionboundary

    4

    Arethesereally

    equallyvalid

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    5/45

    MaxMargin

    Howcanwepickwhichisbest

    Maximizethesizeofthemargin.

    5

    Arethesereally

    equallyvalid

    SmallMargin

    LargeMargin

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    6/45

    SupportVectors

    SupportVectorsarethoseinputpoints(vectors)

    closesttothedecisionboundary

    1.Theyarevectors2.Theysupportthedecisionhyperplane

    6

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    7/45

    SupportVectors

    Definethisasadecisionproblem

    Thedecisionhyperplane:

    Nofancymath,justtheequa?onofahyperplane.

    wT

    x+ b = 0

    7

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    8/45

    SupportVectors

    Aside:Whydosomecassifiersuseor Simplicityofthemath

    andinterpreta?on.

    Forprobabilitydensityfunc?ones?ma?on0,1

    hasaclearcorrelate. Forclassifica?on,a

    decisionboundaryof0ismoreeasilyinterpretablethan.5.

    8

    ti {0, 1

    }ti {1,+1}

    xi are the data

    ti {1, +1} are the labels

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    9/45

    SupportVectors

    Definethisasadecisionproblem

    Thedecisionhyperplane:

    DecisionFunc?on:w

    Tx+ b = 0

    D( xi) = sign( wT

    xi + b)9

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    10/45

    SupportVectors

    Definethisasadecisionproblem

    Thedecisionhyperplane:

    Marginhyperplanes:w

    Tx+ b = 0

    10

    wTx+ b =

    wTx+ b =

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    11/45

    SupportVectors

    Thedecisionhyperplane:

    Scaleinvariancew

    Tx+ b = 0

    11

    cw

    Tx+ cb = 0

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    12/45

    SupportVectors

    Thedecisionhyperplane:

    Scaleinvariancew

    Tx+ b = 0

    12

    cw

    Tx+ cb = 0

    wTx+ b =

    wTx+ b =

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    13/45

    SupportVectors

    Thedecisionhyperplane:

    Scaleinvariancew

    Tx+ b = 0

    13

    Thisscalingdoesnotchangethe

    decisionhyperplane,orthesupport

    vectorhyperplanes.Butwewill

    eliminateavariablefromthe

    op?miza?on

    wT

    x + b

    = 1

    wT

    x + b

    = 1

    cw

    Tx+ cb = 0

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    14/45

    Whatareweop?mizing

    Wewillrepresentthesizeofthemarginintermsofw.

    Thiswillallowustosimultaneously

    Iden?fyadecisionboundary

    Maximizethemargin

    14

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    15/45

    Howdowerepresentthesizeofthe

    marginintermsofw1. Theremustatleastone

    pointthatliesoneach

    supporthyperplanes

    15

    x1

    x2

    Proofoutline:Ifnot,we

    coulddefinealarger

    marginsupporthyperplane

    thatdoestouchthenearest

    point(s).

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    16/45

    Howdowerepresentthesizeofthe

    marginintermsofw1. Theremustatleastone

    pointthatliesoneach

    supporthyperplanes

    16

    x1

    x2

    Proofoutline:Ifnot,we

    coulddefinealarger

    marginsupporthyperplane

    thatdoestouchthenearest

    point(s).

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    17/45

    Howdowerepresentthesizeofthe

    marginintermsofw1. Theremustatleastone

    pointthatliesoneach

    supporthyperplanes

    2. Thus:

    17

    x1

    x2

    wTx1 + b = 1

    wTx2 + b = 1

    3. And:w

    T(x1 x2) = 2

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    18/45

    Howdowerepresentthesizeofthe

    marginintermsofw1. Theremustatleastone

    pointthatliesoneach

    supporthyperplanes

    2. Thus:

    18

    x1

    x2

    wTx1 + b = 1

    wTx2 + b = 1

    3. And:w

    T(x1 x2) = 2

    wTx+ b = 0

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    19/45

    Thevectorwisperpendiculartothe

    decisionhyperplane

    Ifthedotproductoftwovectorsequalszero,thetwo

    vectorsareperpendicular.

    Howdowerepresentthesizeofthe

    marginintermsofw

    19

    x1

    x2

    wTx+ b = 0

    w

    wT(x1 x2) = 2

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    20/45

    Themarginistheprojec?onofx

    1x

    2ontow,thenormal

    ofthehyperplane.

    Howdowerepresentthesizeofthe

    marginintermsofw

    20

    x1

    x2

    wTx+ b = 0

    w

    wT(x1x2) = 2

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    21/45

    Aside:VectorProjec?on

    21

    uv

    cos() =adjacent

    hypotenuse

    v

    u

    uu

    vu = vu cos()

    hypotenuse=

    v adjacent

    =

    goal

    cos() = goal

    v

    vu

    vu cos() = goal

    v

    v u

    vu=

    goal

    v

    v

    u

    u= goal

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    22/45

    Themarginistheprojec?onofx

    1x

    2ontow,thenormal

    ofthehyperplane.

    Howdowerepresentthesizeofthe

    marginintermsofw

    22

    x1

    x2

    wTx+ b = 0

    w

    wT(x1

    x2) = 2

    v u

    uu

    wT(x1 x2)

    ww

    2

    w

    SizeoftheMargin:

    Projec?on:

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    23/45

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    24/45

    MaxMarginLossFunc?on

    Ifconstraintop7miza7onthenLagrangeMul7pliers

    Op?mizethePrimal

    24

    L(w, b) =

    1

    2 ww

    N1

    i=0

    i[ti((wxi) + b)

    1]

    min w

    where ti(wTxi + b) 1

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    25/45

    MaxMarginLossFunc?on

    Op?mizethePrimal

    25

    L(w, b) =1

    2

    w w

    N1

    i=0

    i[ti((w xi) + b) 1]

    L(w, b)

    b= 0

    N1

    i=0

    iti = 0

    Par?alwrtb

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    26/45

    MaxMarginLossFunc?on

    Op?mizethePrimal

    26

    L(w, b) =1

    2

    w w

    N1

    i=0

    i[ti((w xi) + b) 1]

    L(w, b)

    w= 0

    w

    N1

    i=0

    iti xi = 0

    w =N1

    i=0

    iti xi

    Par?alwrtw

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    27/45

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    28/45

    MaxMarginLossFunc?on

    Constructthedual

    28

    L(w, b) =1

    2

    w w

    N1

    i=0

    i[ti((w xi) + b) 1]

    w =

    N1

    i=0

    iti xi

    W() =N1

    i=0

    i 12

    N1

    i,j=0

    ijtitj(xi xj)

    where i 0

    N1

    i=0

    iti = 0

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    29/45

    Dualformula?onoftheerror

    Op?mizethisquadra7cprogramtoiden?fythelagrangemul?pliersandthustheweights

    29

    W() =

    N1

    i=0

    i

    12

    N1

    i,j=0

    ijtitj(xixj)

    where i 0

    Thereexist(extremely)fastapproachestoquadra?c

    op?miza?oninbothC,C++,Python,JavaandR

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    30/45

    Quadra?cProgramming

    30

    subject to (one or more) Ax k

    Bx = l

    minimize f(x) =1

    2xTQx + cTx

    IfQisposi?vesemidefinite,thenf(x)isconvex.Iff(x)isconvex,thenthereisasinglemaximum.

    W() =N1

    i=0

    i 1

    2

    N1

    i,j=0

    ijtitj(xi xj)

    where i 0

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    31/45

    SupportVectorExpansion

    Wheniisnon-zerothenxiisasupportvector Wheniiszeroxiisnotasupportvector

    31

    w =

    N1

    i=0

    iti xi

    D(x) = sign(wTx + b)

    = signN1

    i=0

    iti xiT

    x + b

    = sign

    N1

    i=0 iti(xiTx)

    + b

    NewdecisionFunc?onIndependentofthe

    Dimensionofx!

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    32/45

    Kuhn-TuckerCondi?ons

    Inconstraintop?miza?on:Attheop?malsolu?on

    Constraint*LagrangeMul?plier=0

    32

    i(1 ti(wTxi + b)) = 0

    ifi = 0 ti(wTxi + b) = 1

    Onlypointsonthedecisionboundarycontributetothesolu?on!

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    33/45

    Visualiza?onofSupportVectors

    33

    = 0

    > 0

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    34/45

    InterpretabilityofSVMparameters

    WhatelsecanwetellfromalphasIfalphaislarge,thentheassociateddatapointisquiteimportant.

    Itseitheranoutlier,orincrediblyimportant. Butthisonlygivesusthebestsolu?onforlinearlyseparabledatasets

    34

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    35/45

    BasisofKernelMethods

    Thedecisionprocessdoesntdependonthedimensionalityofthedata.

    Wecanmaptoahigherdimensionalityofthedataspace. Note:datapointsonlyappearwithinadotproduct. Theerrorisbasedonthedotproductofdatapointsnotthedata

    pointsthemselves.

    35

    W() =N1

    i=0

    i 1

    2

    N1

    i,j=0

    ijtitj(xi xj)

    w =

    N1

    i=0

    iti xi

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    36/45

    BasisofKernelMethods

    Sincedatapointsonlyappearwithinadotproduct. Thuswecanmaptoanotherspacethroughareplacement

    Theerrorisbasedonthedotproductofdatapointsnotthedatapointsthemselves.

    36

    xi xj (xi) ( xj)

    W() =N1

    i=0

    i

    1

    2

    N1

    i,j=0

    ijtitj

    ((xi) ( x

    j))

    W() =N1

    i=0

    i 1

    2

    N1

    i,j=0

    ijtitj(xi xj)

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    37/45

    LearningTheorybasesofSVMs

    Theore?calboundsontes?ngerror.Theupperbounddoesntdependonthedimensionalityofthespace

    Thelowerboundismaximizedbymaximizingthemargin,,associatedwiththedecisionboundary.

    37

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    38/45

    WhywelikeSVMs

    TheyworkGoodgeneraliza?on

    Easilyinterpreted.Decisionboundaryisbasedonthedataintheformofthesupportvectors.

    Notsoinmul?layerperceptronnetworks Principledboundsontes?ngerrorfromLearningTheory(VCdimension)

    38

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    39/45

    SVMvs.MLP

    SVMshavemanyfewerparametersSVM:MaybejustakernelparameterMLP:Numberandarrangementofnodesandetalearningrate

    SVM:Convexop?miza?ontaskMLP:likelihoodisnon-convex--localminima

    39

    R() =1

    N

    Nn=0

    1

    2

    yn g

    k

    wklg

    j

    wjkg

    i

    wijxn,i

    2

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    40/45

    Somarginclassifica?on

    Therecanbeoutliersontheothersideofthedecisionboundary,orleadingtoasmallmargin.

    Solu?on:Introduceapenaltytermtotheconstraintfunc?on

    40

    min w+ CN1

    i=0

    i

    where ti(wTxi + b) 1 i and i 0

    L(w, b) = 12w

    w + C

    N1

    i=0

    i

    N1

    i=0

    i[ti((wxi) + b) + i

    1]

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    41/45

    SoMaxDual

    41

    min w+ CN1

    i=0

    i

    where ti(wTxi + b) 1 i and i 0

    L(w, b) = 12w

    w + C

    N1

    i=0

    i

    N1

    i=0

    i[ti((w xi) + b) + i 1]

    N1

    i=0

    iti = 0where 0 i C

    W() =

    N1

    i=0

    i

    1

    2

    N1

    i,j=0

    titjij(xixj)

    S?llQuadra?cProgramming!

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    42/45

    Pointsareallowedwithinthemargin,butcostis

    introduced.

    Somarginexample

    42

    x1

    x2iHingeLoss

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    43/45

    Probabili?esfromSVMs

    SupportVectorMachinesarediscriminantfunc?ons

    Discriminantfunc?ons:f(x)=cDiscrimina?vemodels:f(x)=argmaxcp(c|x)Genera?veModels:f(x)=argmax cp(x|c)p(c)/p(x)

    No(principled)probabili?esfromSVMs SVMsarenotbasedonprobabilitydistribu?onfunc?onsofclassinstances.

    43

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    44/45

    EfficiencyofSVMs

    Notespeciallyfast. Trainingn^3

    Quadra?cProgrammingefficiency

    Evalua?onnNeedtoevaluateagainsteachsupportvector(poten?allyn)

    44

  • 8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

    45/45

    GoodBye

    Next?me:TheKernelTrick->KernelMethodsorHowcanweuseSVMsthatarenotlinearlyseparable

    45