Download - Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

LogisticRegression

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.

ClassificationBasedonProbability• Insteadofjustpredictingtheclass,givetheprobabilityoftheinstancebeingthatclass– i.e.,learn

• Comparisontoperceptron:– Perceptrondoesn’tproduceprobabilityestimate– Perceptron(andotherdiscriminativeclassifiers)areonlyinterestedinproducingadiscriminativemodel

• Recallthat:

2

p(y | x)

p(event) + p(¬event) = 1

0 p(event) 1

LogisticRegression• Takesaprobabilisticapproachtolearningdiscriminativefunctions(i.e.,aclassifier)

• shouldgive– Want

• Logisticregressionmodel:

3

h✓(x) = g (✓|x)

g(z) =1

1 + e�z

0 h✓(x) 1

Can’tjustuselinearregressionwitha

threshold

g(z) =1

1 + e�z

h✓(x) =1

1 + e�✓Tx

Logistic/SigmoidFunction

h✓(x) p(y = 1 | x;✓)

InterpretationofHypothesisOutput

4

=estimated

à Tellpatientthat70%chanceoftumorbeingmalignant

Example:Cancerdiagnosisfromtumorsize

h✓(x) p(y = 1 | x;✓)

x =

x0

x1

�=

1

tumorSize

�

h✓(x) = 0.7

p(y = 0 | x;✓) + p(y = 1 | x;✓) = 1Notethat:

BasedonexamplebyAndrewNg

Therefore, p(y = 0 | x;✓) = 1� p(y = 1 | x;✓)

AnotherInterpretation• Equivalently,logisticregressionassumesthat

• Inotherwords,logisticregressionassumesthatthelogoddsisalinearfunctionof

5

logp(y = 1 | x;✓)p(y = 0 | x;✓) = ✓0 + ✓1x1 + . . .+ ✓dxd

x

SideNote:theoddsinfavorofaneventisthequantityp /(1−p),wherep istheprobabilityoftheevent

E.g.,IfItossafairdice,whataretheoddsthatIwillhavea6?

oddsofy =1

BasedonslidebyXiaoli Fern

LogisticRegression

• Assumeathresholdand...

– Predicty =1if– Predicty =0if

6

h✓(x) = g (✓|x)

g(z) =1

1 + e�z

g(z) =1

1 + e�z

h✓(x) � 0.5

h✓(x) < 0.5

y =1

y =0

✓

BasedonslidebyAndrewNg

shouldbelargenegativevaluesfornegativeinstances

h✓(x) = g (✓|x) shouldbelargepositivevaluesforpositiveinstances

h✓(x) = g (✓|x)

Non-LinearDecisionBoundary• Canapplybasisfunctionexpansiontofeatures,sameaswithlinearregression

7

x =

2

41x1

x2

3

5 !

2

6666666666666664

1x1

x2

x1x2

x21

x22

x21x2

x1x22

...

3

7777777777777775

LogisticRegression

• Given

where

• Model:

8

x| =⇥1 x1 . . . xd

⇤✓ =

2

6664

✓0✓1...✓d

3

7775

h✓(x) = g (✓|x)

g(z) =1

1 + e�z

n⇣x(1), y(1)

⌘,⇣x(2), y(2)

⌘, . . . ,

⇣x(n), y(n)

⌘o

x(i) 2 Rd, y(i) 2 {0, 1}

LogisticRegressionObjectiveFunction• Can’tjustusesquaredlossasinlinearregression:

– Usingthelogisticregressionmodel

resultsinanon-convexoptimization

9

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

h✓(x) =1

1 + e�✓Tx

DerivingtheCostFunctionviaMaximumLikelihoodEstimation

• Likelihoodofdataisgivenby:

• So,lookingfortheθ thatmaximizesthelikelihood

• Cantakethelogwithoutchangingthesolution:

10

l(✓) =nY

i=1

p(y(i) | x(i);✓)

✓MLE = argmax✓

l(✓) = argmax✓

nY

i=1

p(y(i) | x(i);✓)

✓MLE = argmax✓

lognY

i=1

p(y(i) | x(i);✓)

= argmax✓

nX

i=1

log p(y(i) | x(i);✓)

✓MLE = argmax✓

lognY

i=1

p(y(i) | x(i);✓)

= argmax✓

nX

i=1


DerivingtheCostFunctionviaMaximumLikelihoodEstimation

11

• Expandasfollows:

• Substituteinmodel,andtakenegativetoyield

✓MLE = argmax✓

nX

i=1


= argmax✓

nX

i=1

hy(i) log p(y(i)=1 | x(i);✓) +

⇣1� y(i)

⌘log

⇣1� p(y(i)=1 | x(i);✓)

⌘i

J(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

Logisticregressionobjective:min✓

J(✓)

✓MLE = argmax✓

nX

i=1


= argmax✓

nX

i=1

hy(i) log p(y(i)=1 | x(i);✓) +

⇣1� y(i)

⌘log

⇣1� p(y(i)=1 | x(i);✓)

⌘i

IntuitionBehindtheObjective

• Costofasingleinstance:

• Canre-writeobjectivefunctionas

12

J(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

J(✓) =nX

i=1

cost⇣h✓(x

(i)), y(i)⌘

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2Comparetolinearregression:


13

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

Aside:Recalltheplotoflog(z)


Ify =1• Cost=0ifpredictioniscorrect• As

• Capturesintuitionthatlargermistakesshouldgetlargerpenalties– e.g.,predict,buty =1

14

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

h✓(x) ! 0, cost ! 1

h✓(x) = 0


Ify =1

10

cost

h✓(x) = 0


15

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

Ify =0

10

cost

Ify =1

Ify =0• Cost=0ifpredictioniscorrect• As

• Capturesintuitionthatlargermistakesshouldgetlargerpenalties

(1� h✓(x)) ! 0, cost ! 1


h✓(x) = 0

RegularizedLogisticRegression

• Wecanregularizelogisticregressionexactlyasbefore:

16

J(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

Jregularized(✓) = J(✓) +�

2

dX

j=1

✓2j

= J(✓) +�

2k✓[1:d]k22

GradientDescentforLogisticRegression

17

• Initialize• Repeatuntilconvergence

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

Want min✓

J(✓)

Usethenaturallogarithm(ln =loge)tocancelwiththeexp()in h✓(x) =1

1 + e�✓Tx

Jreg(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

+�

2k✓[1:d]k22


18

Want min✓

J(✓)


✓(simultaneousupdateforj =0...d)

Jreg(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

+�

2k✓[1:d]k22

✓0 ✓0 � ↵nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘

✓j ✓j � ↵

"nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j + �✓j

#


19


✓(simultaneousupdateforj =0...d)

ThislooksIDENTICALtolinearregression!!!• Ignoringthe1/n constant• However,theformofthemodelisverydifferent:

h✓(x) =1

1 + e�✓Tx

✓0 ✓0 � ↵nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘

✓j ✓j � ↵

"nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j + �✓j

#

StochasticGradientDescent

20

ConsiderLearningwithNumerousData• Logisticregressionobjective:

• Fitviagradientdescent:

• Whatisthecomputationalcomplexityintermsofn?

21

✓j ✓j � ↵1

n

nX

i=1

(h✓ (xi)� yi)xij

J(✓) = � 1

n

nX

i=1

[yi log h✓(xi) + (1� yi) log (1� h✓(xi))]

@

@✓jcost✓(xi, yi)

GradientDescent

22

BatchGradientDescentInitializeθRepeat{

}✓j ✓j � ↵

1

n

nX

i=1

(h✓ (xi)� yi)xij forj = 0...d

StochasticGradientDescentInitializeθRandomlyshuffledatasetRepeat{

Fori = 1...n,do

}forj = 0...d✓j ✓j � ↵ (h✓ (xi)� yi)xij

@

@✓jJ(✓)

@

@✓jcost✓(xi, yi)

(Typically1– 10x)

Batchvs StochasticGDBatchGD StochasticGD

23

• Learningrateαistypicallyheldconstant• Canslowlydecreaseαovertimetoforceθ toconverge:

e.g.,

BasedonslidebyAndrewNg

↵t =constant1

iterationNumber + constant2

Adagrad

24

NewStochasticGradientAlgorithms

• Sofar,wehaveconsidered:• aconstantlearningrateα• atime-dependentlearningrateαtviaapre-setformula

• AdaGrad adjuststhelearningratebasedonhistoricalinformation• Frequentlyoccurringfeaturesinthegradientsgetsmalllearning

ratesandinfrequentfeaturesgethigherones• Keyidea:“learnslowly”fromfrequentfeaturesbut“payattention”

torarebutinformativefeatures

• Defineaper-featurelearningrateforfeaturej as:

• Gt,j isthesumofsquaresofgradientsoffeaturej throughtimet25

↵t,j =↵pGt,j

Gt,j =tX

k=1

g2k,jwhere @

@✓jcost✓(xk, yk)

k✓�✓⇤k 2

Time

Withabadchoiceforα

k✓�✓⇤k 2

Time

Withagoodchoiceforα

NewStochasticGradientAlgorithms

• Adagrad changestheupdateruleforSGDattimet from

to

• Adagrad convergesquickly:

26

↵t,j =↵pGt,j

Gt,j =tX

k=1

g2k,jwhere

Adagrad per-featurelearningrate

✓j ✓j �↵p

Gt,j + ⇣gt,j

✓j ✓j � ↵gt,j

Inpractice,weaddasmallconstant𝜁 >0topreventdividing

byzeroerrors

Plotsfromhttp://akyrillidis.github.io/notes/AdaGrad

Multi-ClassClassification

27

Multi-ClassClassification

Diseasediagnosis: healthy/cold/flu/pneumonia

Objectclassification: desk/chair/monitor/bookcase28

x1

x2

x1

x2

Binaryclassification: Multi-classclassification:

h✓(x) =1

1 + exp(�✓Tx)=

exp(✓Tx)

1 + exp(✓Tx)

Multi-ClassLogisticRegression• For2classes:

• ForC classes{1,...,C }:

– Calledthesoftmax function

29

h✓(x) =1

1 + exp(�✓Tx)=

exp(✓Tx)

1 + exp(✓Tx)

weightassignedtoy =

0

weightassignedtoy =

1

p(y = c | x;✓1, . . . ,✓C) =exp(✓T

c x)PCc=1 exp(✓

Tc x)

Multi-ClassLogisticRegression

• Trainalogisticregressionclassifierforeachclassitopredicttheprobabilitythaty =i with

30

x1

x2

SplitintoOnevs Rest:

hc(x) =exp(✓T

c x)PCc=1 exp(✓

Tc x)

hc(x) =exp(✓T

c x)PCc=1 exp(✓

Tc x)

ImplementingMulti-ClassLogisticRegression

• Useasthemodelforclassc

• Gradientdescentsimultaneouslyupdatesallparametersforallmodels– Samederivativeasbefore,justwiththeabovehc(x)

• Predictclasslabelasthemostprobablelabel

31

maxc

hc(x)