Machine Learning for Intelligent Systems - cs.cornell.edu · Lecture 24: Boosting Reading: UML...
Transcript of Machine Learning for Intelligent Systems - cs.cornell.edu · Lecture 24: Boosting Reading: UML...
MachineLearningforIntelligentSystems
Instructors:NikaHaghtalab (thistime)andThorstenJoachims
Lecture24:Boosting
Reading:UML10-10.3OptionalReadings:Schapire’s surveyandtutorial
FundamentalQuestion
Iwantalearningalgorithmthatforanydistribution𝑃 learnsan
excellent classifierℎ#$%&'( suchthat𝑒𝑟𝑟+ ℎ#$%&'( ≤ 0.01.
I’mgivenalearningalgorithm𝐴 thatforanydistribution𝐷 returns
anot-too-terrible classifierℎ2345 suchthat𝑒𝑟𝑟6 ℎ2345 ≤ 0.49.
CanIusethisalgorithm𝐴 tofindℎ#$%&'( ,
𝑒𝑟𝑟+ ℎ#$%&'( ≤ 0.01?
StrongversusWeakLearning
AlearningalgorithmforPAClearning.Forevery distribution𝑃 andevery 𝜖,astronglearnercanreturnaclassifierℎ suchthat𝑒𝑟𝑟+ ℎ ≤ 𝜖.
StrongLearner
Errorofrandomguessing:Foranydistribution𝑃,ignore𝑃 and• foreach𝑥 predict+1 or−1 ,withprobability50-50.• What’stheerror?• Exactly0.5
Betterthanrandomguessing.Forevery distribution𝑃 andsome𝛾 > 0,aweakLearnerreturnsaclassifierℎ suchthat𝑒𝑟𝑟+ ℎ ≤ @
A− 𝛾.
WeakLearner
Withprobability1 − 𝛿
Withprobability1 − 𝛿
BoostingIsthereaboosting algorithmthatturnsaweaklearnerintoastronglearner?
Yes!Thereisboostingalgorithmthatusesaweaklearner onanadaptivelydesignedpolynomial-sizesequenceofdistributionsandstronglearns.
MichaelKearns LeslieValiant
RobertSchapire
WeakLearning=StrongLearningYoavFruend
WarmupSupposeourweaklearnerknowswhenitdoesn’tknow!• ℎ: 𝑥 → +1,−1, Not sure .• Onatmost1 − 𝜖′ fractionofthedata,itcansay“Notsure”.• Onthefractionofthedatathatitissure,itmakes𝜖 error.• Leadstoaweaklearner,if“Notsure”à randomlyguess:
𝑒𝑟𝑟+ N ≤ @A1 − 𝜖O + 𝜖𝜖O ≤ @
A− 𝛾 for𝛾 = 𝜖O @
A− 𝜖 .
Boosting:• Startwithaweaklearner.• Boostbyfocusingthedistributiononinstancesthepreviouslearnerwasn’tsureabout.
WarmupAnalysisBoostbyadecisionlist:• TrainℎQ on𝑃Q. Let𝑃QR@ ← 𝑃Q|{𝑥: ℎQ 𝑥 = “𝑁𝑜𝑡 𝑠𝑢𝑟𝑒”}.• Repeatuntilthetotalprob.ofthe“Notsure”regionis𝜖.• Totalerroratmost2𝜖.
• Itonlytakes𝑇 = @`aln(@
`) rounds: 1 − 𝜖O f ≤ exp −𝜖O𝑇 ≤ 𝜖.
ℎ@ ℎA ℎi𝑁𝑜𝑡 𝑠𝑢𝑟𝑒
ℎ@(𝑥)
𝑁𝑜𝑡 𝑠𝑢𝑟𝑒
ℎA(𝑥)
𝑁𝑜𝑡 𝑠𝑢𝑟𝑒ℎf
𝑁𝑜𝑡 𝑠𝑢𝑟𝑒
ℎf(𝑥)
Randomguess
ℎi(𝑥)
…
Notsure≤ 𝜖Erroronthesampleit’ssureabout:≤ 𝜖Addedafterclass:reasonfortheabove.Conditionedonbeingsure,wearewrongwithprob.≤ 𝜖.So,thetotalprobabilityis ≤ 𝜖.Anotherwaytoseethisis,prob.oferroraftereachround:∑$k@f 𝜖 ×𝜖O 1 − 𝜖O $m@ ≤ 𝜖.
Pro[ℎ$ 𝑥 is sure]Pr
oℎ$ 𝑥 is wrong | ℎ$ 𝑥 is sure ]
ARecipeforBoosting
Input: 𝑥@, 𝑦@ , … , (𝑥w, 𝑦w) andaweaklearningalgorithm.
Let𝑃@ 𝑥Q = @wforall𝑖.i.e.,uniformdistributionoversamples.
For𝑡 = 1,… , 𝑇• Learnaweakclassifierℎ$ ∈ 𝐻 ondistribution𝑃$ .• Construct𝑃$R@ thathashigherweightcomparedto𝑃 oninstancewhereℎ@, … , ℎ$ didn’tperformwell.
Outputthefinalhypothesis
ℎ{Q'4| 𝑥 = sign }$k@
f
𝛼$ ℎ$(𝑥)
BoostingRecipe
Specifytheseweights
Constructing 𝑃$R@Increasetheweightof𝑥Q ifℎ$ madeamistakeonit.Decreasetheweightifℎ$ wascorrect.• Don’twanttocuttheweightto0
à ℎ$R@ couldbearbitrarilybadonwhereℎ$ wasgood.à Themajorityvotecouldbebad.
• Changetheweights,sothatℎ$ wouldhaveheaderrorexactly0.5
𝑃$ 𝒉𝒕 right𝒉𝒕 wrong
𝒉𝒕 right𝒉𝒕 wrong
𝒉𝒕 right𝒉𝒕 wrong
Changetheweights,withoutnormalizing
𝑃$R@Normalize
Useerrorℎ$ on𝑃$
Constructing 𝑃$R@Let𝜖$ = Pr
o�∼+�ℎ$ 𝑥Q ≠ 𝑦Q andlet𝛼$ =
@Aln @m`�
`�.Let
𝑃$R@ 𝑥Q = +� o� ��� m�� �� N� o���
Where𝑍$ = ∑Q 𝑃$ 𝑥Q exp(−𝛼$ 𝑦Q ℎ$(𝑥Q)) isthenormalizationfactor.
Constructingthenextdistribution
𝑃$R@ 𝑥Q =
+� o���
exp(−𝛼$) if𝑦Q = ℎ$(𝑥Q)
+� o���
exp(+𝛼$) if𝑦Q ≠ ℎ$(𝑥Q)
Weightonℎ$ 𝑥Q ≠ 𝑦Q:@��𝜖$ exp
@Aln @m`�
`�= @
��𝜖$
@m`�`�
@/A= `� @m`�
��
Weightonℎ$ 𝑥Q = 𝑦Q: @��(1 − 𝜖$) exp − @
Aln @m`�
`�= @
��(1 − 𝜖$)
`�@m`�
@/A= `� @m`�
��
Weightof𝑃$ oncorrect points
Weightof𝑃$ onincorrect points
AdaptiveBoostingInput: 𝑥@, 𝑦@ , … , (𝑥w, 𝑦w) andaweaklearningalgorithm.
Let𝑃@ 𝑥Q = @wforall𝑖.i.e.,uniformdistributionoversamples.
For𝑡 = 1,… , 𝑇• Learnaweakclassifierℎ$ ∈ 𝐻 ondistribution𝑃$ .
• Let 𝜖$ = Pro�∼+�
ℎ$ 𝑥Q ≠ 𝑦Q andlet𝛼$ =@Aln @m`�
`�.
• 𝑃$R@ 𝑥Q = +� o� ��� m�� �� N� o���
Outputthefinalhypothesis
ℎ{Q'4| 𝑥 = sign }$k@
f
𝛼$ ℎ$(𝑥)
AdaBoostAlgorithm
Example
Assumethattheweaklearnerreturnverticalorhorizontalhalf-spaces(that’sthe𝐻).
ExamplefromSchapire NeurIPS’s 03Tutorial
Round1
Round2
Round3
Thecombinedclassifier
ℎ"#$%& = ()*+ 0.42 +0.65 +0.92ℎ"#$%& = ()*+ 0.42 +0.65 +0.92+0.92+0.65
=
BoundingtheSampleError
Let𝛾$ =@A− 𝜖$ .Forany𝑇,ℎ{Q'4| 𝑥 = 𝑠𝑖𝑔𝑛 ∑$k@f 𝛼$ℎ$ 𝑥 has
trainingerror
𝑒𝑟𝑟� ℎ{Q'4| ≤ exp −2}$k@
f
𝛾$A
So,forweaklearnerswhere𝛾$ > 𝛾,andT = O @��ln(@
`) wehave
𝑒𝑟𝑟� ℎ{Q'4| ≤ 𝜖.
Theorem:AdaBoost’strainingerror
Ada(ptive)Boost:• Adaptive:Wedon’tneedtoknow𝛾 or𝑇 beforewestart.• Canadaptto𝛾$ .• Automaticallybetterwhen𝛾$ ≫ 𝛾.• Practicalalgorithm.
GeneralizationErrorWegaveaguaranteethatthesampleerrorisatmost𝑒𝑟𝑟� 𝐻 ≤ 𝜖.Whataboutgeneralization?• ℎ{Q'4| isacombinationof𝑇 hypothesisℎ@, … , ℎf ∈ 𝐻.• ℎ{Q'4| ∉ 𝐻 possibly,butit’sstillstructured.• RecallfromHomework3
à Combinationof𝑇 hypothesisfrom𝐻 hasaboundedGrowthfunction.
à Roughlyspeaking: Thismeansℎ{Q'4| comesfromaclassofwithVCdimension �𝑂(𝑇 VCDim(𝐻)).
When𝑆 has Ω ¢£6Qw(¤)��`
manysamples,then𝑒𝑟𝑟+ ℎ{Q'4| ≤ 𝜖.
Theorem:AdaBoost’strueerror
BetterGeneralizationGuaranteeLastslide:VCdimension �𝑂(𝑇 VCDim(𝐻))à Keep𝑇 small.As𝑇 increasesthereisachanceofoverfitting.
Trueerror
Trainingerror
Modelcomplexity
Ourfirstguess! ActualrunofAdaBoost.
CooltheoryforprovingwhyAdaBoostdoesn’toverfit.
Schapire andFruend alsogaveonlinelearningalgorithms(lastlecture).Connectionbetweenboostingandregretminimization
Boosting&RegretMinimization
RobertSchapire
YoavFruend
OptionalMaterial
𝑥@, 𝑥A , 𝑥i … , 𝑥w
ℎ@ℎAℎi⋮
ℎ|¤|
𝑀Q§
Foreverydistribution𝑃 overthecolumns,thereisarowwithexpectedpayoff≥ @
A+ 𝛾 .
èBoosting:Distribution𝑄 overℎ@, ℎA, … thatis≥@A+ 𝛾 forevery𝑥Q .
èRegretminimizationagainstanadversarywhoisbestrespondingresultsinthesequenceℎ@, ℎA, …
𝑀Q§ = ±1 dependingoncorrectness.
Metalearningalgorithmsthatcallmultiplealgorithmstoimprovelearningperformance.
EnsembleMethods
ℎ3'#3w«|3 𝑥 = sign }$k@
f
𝛼$ ℎ$(𝑥)
Boosting:Takeonesampleset𝑆,learnℎ$ fordifferentweightonthesesamples.Take𝛼$-weightedmajorityvote.à Improvetrainingerroroftheweakclassifiersℎ$’s.
BaggingEvenifthetrainingerrorisalreadygood(bias),canwedecreasethevariance?
ℎ«4((Q'( 𝑥 = sign }$k@
f
𝛼$ ℎ$(𝑥)
𝛼$ = 1 ℎ$ trainedonsubsamples
Input:S = { 𝑥@, 𝑦@ , … , 𝑥w, 𝑦w } andanylearningalgorithm.For𝑡 = 1,… , 𝑇• 𝑆$ = samplewithreplacementfrom𝑆.• ℎ$ = trainonthesampleset𝑆$ .Return sign ∑$k@f ℎ$(𝑥)
Bagging(BootstrapAggregating)
Enjoythe
HappyThanksgiving!