Machine Learning for Intelligent Systems - cs.cornell.edu · Lecture 24: Boosting Reading: UML...

MachineLearningforIntelligentSystems

Instructors:NikaHaghtalab (thistime)andThorstenJoachims

Lecture24:Boosting

Reading:UML10-10.3OptionalReadings:Schapire’s surveyandtutorial

FundamentalQuestion

Iwantalearningalgorithmthatforanydistribution𝑃 learnsan

excellent classifierℎ#$%&'( suchthat𝑒𝑟𝑟+ ℎ#$%&'( ≤ 0.01.

I’mgivenalearningalgorithm𝐴 thatforanydistribution𝐷 returns

anot-too-terrible classifierℎ2345 suchthat𝑒𝑟𝑟6 ℎ2345 ≤ 0.49.

CanIusethisalgorithm𝐴 tofindℎ#$%&'( ,

𝑒𝑟𝑟+ ℎ#$%&'( ≤ 0.01?

StrongversusWeakLearning

AlearningalgorithmforPAClearning.Forevery distribution𝑃 andevery 𝜖,astronglearnercanreturnaclassifierℎ suchthat𝑒𝑟𝑟+ ℎ ≤ 𝜖.

StrongLearner

Errorofrandomguessing:Foranydistribution𝑃,ignore𝑃 and• foreach𝑥 predict+1 or−1 ,withprobability50-50.• What’stheerror?• Exactly0.5

Betterthanrandomguessing.Forevery distribution𝑃 andsome𝛾 > 0,aweakLearnerreturnsaclassifierℎ suchthat𝑒𝑟𝑟+ ℎ ≤ @

A− 𝛾.

WeakLearner

Withprobability1 − 𝛿

Withprobability1 − 𝛿

BoostingIsthereaboosting algorithmthatturnsaweaklearnerintoastronglearner?

Yes!Thereisboostingalgorithmthatusesaweaklearner onanadaptivelydesignedpolynomial-sizesequenceofdistributionsandstronglearns.

MichaelKearns LeslieValiant

RobertSchapire

WeakLearning=StrongLearningYoavFruend

WarmupSupposeourweaklearnerknowswhenitdoesn’tknow!• ℎ: 𝑥 → +1,−1, Not sure .• Onatmost1 − 𝜖′ fractionofthedata,itcansay“Notsure”.• Onthefractionofthedatathatitissure,itmakes𝜖 error.• Leadstoaweaklearner,if“Notsure”à randomlyguess:

𝑒𝑟𝑟+ N ≤ @A1 − 𝜖O + 𝜖𝜖O ≤ @

A− 𝛾 for𝛾 = 𝜖O @

A− 𝜖 .

Boosting:• Startwithaweaklearner.• Boostbyfocusingthedistributiononinstancesthepreviouslearnerwasn’tsureabout.

WarmupAnalysisBoostbyadecisionlist:• TrainℎQ on𝑃Q. Let𝑃QR@ ← 𝑃Q|{𝑥: ℎQ 𝑥 = “𝑁𝑜𝑡 𝑠𝑢𝑟𝑒”}.• Repeatuntilthetotalprob.ofthe“Notsure”regionis𝜖.• Totalerroratmost2𝜖.

• Itonlytakes𝑇 = @`aln(@

`) rounds: 1 − 𝜖O f ≤ exp −𝜖O𝑇 ≤ 𝜖.

ℎ@ ℎA ℎi𝑁𝑜𝑡 𝑠𝑢𝑟𝑒

ℎ@(𝑥)

𝑁𝑜𝑡 𝑠𝑢𝑟𝑒

ℎA(𝑥)

𝑁𝑜𝑡 𝑠𝑢𝑟𝑒ℎf

𝑁𝑜𝑡 𝑠𝑢𝑟𝑒

ℎf(𝑥)

Randomguess

ℎi(𝑥)

…

Notsure≤ 𝜖Erroronthesampleit’ssureabout:≤ 𝜖Addedafterclass:reasonfortheabove.Conditionedonbeingsure,wearewrongwithprob.≤ 𝜖.So,thetotalprobabilityis ≤ 𝜖.Anotherwaytoseethisis,prob.oferroraftereachround:∑$k@f 𝜖 ×𝜖O 1 − 𝜖O $m@ ≤ 𝜖.

Pro[ℎ$ 𝑥 is sure]Pr

oℎ$ 𝑥 is wrong | ℎ$ 𝑥 is sure ]

ARecipeforBoosting

Input: 𝑥@, 𝑦@ , … , (𝑥w, 𝑦w) andaweaklearningalgorithm.

Let𝑃@ 𝑥Q = @wforall𝑖.i.e.,uniformdistributionoversamples.

For𝑡 = 1,… , 𝑇• Learnaweakclassifierℎ$ ∈ 𝐻 ondistribution𝑃$ .• Construct𝑃$R@ thathashigherweightcomparedto𝑃 oninstancewhereℎ@, … , ℎ$ didn’tperformwell.

Outputthefinalhypothesis

ℎ{Q'4| 𝑥 = sign }$k@

f

𝛼$ ℎ$(𝑥)

BoostingRecipe

Specifytheseweights

Constructing 𝑃$R@Increasetheweightof𝑥Q ifℎ$ madeamistakeonit.Decreasetheweightifℎ$ wascorrect.• Don’twanttocuttheweightto0

à ℎ$R@ couldbearbitrarilybadonwhereℎ$ wasgood.à Themajorityvotecouldbebad.

• Changetheweights,sothatℎ$ wouldhaveheaderrorexactly0.5

𝑃$ 𝒉𝒕 right𝒉𝒕 wrong

𝒉𝒕 right𝒉𝒕 wrong

𝒉𝒕 right𝒉𝒕 wrong

Changetheweights,withoutnormalizing

𝑃$R@Normalize

Useerrorℎ$ on𝑃$

Constructing 𝑃$R@Let𝜖$ = Pr

o�∼+�ℎ$ 𝑥Q ≠ 𝑦Q andlet𝛼$ =

@Aln @m`�

`�.Let

𝑃$R@ 𝑥Q = +� o� �� m�� N� o��

Where𝑍$ = ∑Q 𝑃$ 𝑥Q exp(−𝛼$ 𝑦Q ℎ$(𝑥Q)) isthenormalizationfactor.

Constructingthenextdistribution

𝑃$R@ 𝑥Q =

+� o��

exp(−𝛼$) if𝑦Q = ℎ$(𝑥Q)

+� o��

exp(+𝛼$) if𝑦Q ≠ ℎ$(𝑥Q)

Weightonℎ$ 𝑥Q ≠ 𝑦Q:@��𝜖$ exp

@Aln @m`�

`�= @

��𝜖$

@m`�`�

@/A= `� @m`�

��

Weightonℎ$ 𝑥Q = 𝑦Q: @��(1 − 𝜖$) exp − @

Aln @m`�

`�= @

��(1 − 𝜖$)

`�@m`�

@/A= `� @m`�

��

Weightof𝑃$ oncorrect points

Weightof𝑃$ onincorrect points

AdaptiveBoostingInput: 𝑥@, 𝑦@ , … , (𝑥w, 𝑦w) andaweaklearningalgorithm.

Let𝑃@ 𝑥Q = @wforall𝑖.i.e.,uniformdistributionoversamples.

For𝑡 = 1,… , 𝑇• Learnaweakclassifierℎ$ ∈ 𝐻 ondistribution𝑃$ .

• Let 𝜖$ = Pro�∼+�

ℎ$ 𝑥Q ≠ 𝑦Q andlet𝛼$ =@Aln @m`�

`�.

• 𝑃$R@ 𝑥Q = +� o� �� m�� N� o��

Outputthefinalhypothesis

ℎ{Q'4| 𝑥 = sign }$k@

f

𝛼$ ℎ$(𝑥)

AdaBoostAlgorithm

Example

Assumethattheweaklearnerreturnverticalorhorizontalhalf-spaces(that’sthe𝐻).

ExamplefromSchapire NeurIPS’s 03Tutorial

Round1

Round2

Round3

Thecombinedclassifier

ℎ"#$%& = ()*+ 0.42 +0.65 +0.92ℎ"#$%& = ()*+ 0.42 +0.65 +0.92+0.92+0.65

=

BoundingtheSampleError

Let𝛾$ =@A− 𝜖$ .Forany𝑇,ℎ{Q'4| 𝑥 = 𝑠𝑖𝑔𝑛 ∑$k@f 𝛼$ℎ$ 𝑥 has

trainingerror

𝑒𝑟𝑟� ℎ{Q'4| ≤ exp −2}$k@

f

𝛾$A

So,forweaklearnerswhere𝛾$ > 𝛾,andT = O @��ln(@

`) wehave

𝑒𝑟𝑟� ℎ{Q'4| ≤ 𝜖.

Theorem:AdaBoost’strainingerror

Ada(ptive)Boost:• Adaptive:Wedon’tneedtoknow𝛾 or𝑇 beforewestart.• Canadaptto𝛾$ .• Automaticallybetterwhen𝛾$ ≫ 𝛾.• Practicalalgorithm.

GeneralizationErrorWegaveaguaranteethatthesampleerrorisatmost𝑒𝑟𝑟� 𝐻 ≤ 𝜖.Whataboutgeneralization?• ℎ{Q'4| isacombinationof𝑇 hypothesisℎ@, … , ℎf ∈ 𝐻.• ℎ{Q'4| ∉ 𝐻 possibly,butit’sstillstructured.• RecallfromHomework3

à Combinationof𝑇 hypothesisfrom𝐻 hasaboundedGrowthfunction.

à Roughlyspeaking: Thismeansℎ{Q'4| comesfromaclassofwithVCdimension �𝑂(𝑇 VCDim(𝐻)).

When𝑆 has Ω ¢£6Qw(¤)��`

manysamples,then𝑒𝑟𝑟+ ℎ{Q'4| ≤ 𝜖.

Theorem:AdaBoost’strueerror

BetterGeneralizationGuaranteeLastslide:VCdimension �𝑂(𝑇 VCDim(𝐻))à Keep𝑇 small.As𝑇 increasesthereisachanceofoverfitting.

Trueerror

Trainingerror

Modelcomplexity

Ourfirstguess! ActualrunofAdaBoost.

CooltheoryforprovingwhyAdaBoostdoesn’toverfit.

Schapire andFruend alsogaveonlinelearningalgorithms(lastlecture).Connectionbetweenboostingandregretminimization

Boosting&RegretMinimization

RobertSchapire

YoavFruend

OptionalMaterial

𝑥@, 𝑥A , 𝑥i … , 𝑥w

ℎ@ℎAℎi⋮

ℎ|¤|

𝑀Q§

Foreverydistribution𝑃 overthecolumns,thereisarowwithexpectedpayoff≥ @

A+ 𝛾 .

èBoosting:Distribution𝑄 overℎ@, ℎA, … thatis≥@A+ 𝛾 forevery𝑥Q .

èRegretminimizationagainstanadversarywhoisbestrespondingresultsinthesequenceℎ@, ℎA, …

𝑀Q§ = ±1 dependingoncorrectness.

Metalearningalgorithmsthatcallmultiplealgorithmstoimprovelearningperformance.

EnsembleMethods

ℎ3'#3w«|3 𝑥 = sign }$k@

f

𝛼$ ℎ$(𝑥)

Boosting:Takeonesampleset𝑆,learnℎ$ fordifferentweightonthesesamples.Take𝛼$-weightedmajorityvote.à Improvetrainingerroroftheweakclassifiersℎ$’s.

BaggingEvenifthetrainingerrorisalreadygood(bias),canwedecreasethevariance?

ℎ«4((Q'( 𝑥 = sign }$k@

f

𝛼$ ℎ$(𝑥)

𝛼$ = 1 ℎ$ trainedonsubsamples

Input:S = { 𝑥@, 𝑦@ , … , 𝑥w, 𝑦w } andanylearningalgorithm.For𝑡 = 1,… , 𝑇• 𝑆$ = samplewithreplacementfrom𝑆.• ℎ$ = trainonthesampleset𝑆$ .Return sign ∑$k@f ℎ$(𝑥)

Bagging(BootstrapAggregating)

Enjoythe

HappyThanksgiving!

Machine Learning for Intelligent Systems - cs.cornell.edu · Lecture 24: Boosting Reading: UML...

Documents

Transcript of Machine Learning for Intelligent Systems - cs.cornell.edu · Lecture 24: Boosting Reading: UML...