The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive...
Transcript of The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive...
The Boosting Approach to
Machine Learning
Maria-Florina Balcan
10/01/2018
Boosting
β’ Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.
β’ Backed up by solid foundations.
β’ Works well in practice (Adaboost and its variations one of the top 10 algorithms).
β’ General method for improving the accuracy of any given learning algorithm.
Readings:
β’ The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001
β’ Theory and Applications of Boosting. NIPS tutorial.
http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf
Plan for today:
β’ Motivation.
β’ A bit of history.
β’ Adaboost: algo, guarantees, discussion.
β’ Focus on supervised classification.
An Example: Spam Detection
Key observation/motivation:
β’ Easy to find rules of thumb that are often correct.
β’ Harder to find single rule that is very highly accurate.
β’ E.g., βIf buy now in the message, then predict spam.β
Not spam spam
β’ E.g., βIf say good-bye to debt in the message, then predict spam.β
β’ E.g., classify which emails are spam and which are important.
β’ Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.
An Example: Spam Detection
β’ apply weak learner to a subset of emails, obtain rule of thumb
β’ apply to 2nd subset of emails, obtain 2nd rule of thumb
β’ apply to 3rd subset of emails, obtain 3rd rule of thumb
β’ repeat T times; combine weak rules into a single highly accurate rule.
ππ
ππ
ππ
ππ»
β¦E
mai
ls
Boosting: Important Aspects
How to choose examples on each round?
How to combine rules of thumb into single prediction rule?
β’ take (weighted) majority vote of rules of thumb
β’ Typically, concentrate on βhardestβ examples (those most often misclassified by previous rules of thumb)
Historicallyβ¦.
Weak Learning vs Strong/PAC Learning
β’ [Kearns & Valiant β88]: defined weak learning:being able to predict better than random guessing
(error β€1
2β πΎ) , consistently.
β’ Posed an open pb: βDoes there exist a boosting algo that turns a weak learner into a strong PAC learner (that can
produce arbitrarily accurate hypotheses)?β
β’ Informally, given βweakβ learning algo that can consistently
find classifiers of error β€1
2β πΎ, a boosting algo would
provably construct a single classifier with error β€ π.
Weak Learning vs Strong/PAC Learning
Strong (PAC) Learning
β’ β algo A
β’ β π β π»
β’ βπ·
β’ β π > 0
β’ β πΏ > 0
β’ A produces h s.t.:
Weak Learning
β’ β algo A
β’ βπΎ > 0
β’ β π β π»
β’ βπ·
β’ β π >1
2β πΎ
β’ β πΏ > 0
β’ A produces h s.t.Pr πππ β β₯ π β€ πΏ
Pr πππ β β₯ π β€ πΏ
h
β’ [Kearns & Valiant β88]: defined weak learning & posed an open pb of finding a boosting algo.
Surprisinglyβ¦.
Weak Learning =Strong (PAC) Learning
Original Construction [Schapire β89]:
β’ poly-time boosting algo, exploits that we can learn a little on every distribution.
β’ A modest booster obtained via calling the weak learning algorithm on 3 distributions.
β’ Cool conceptually and technically, not very practical.
β’ Then amplifies the modest boost of accuracy by running this somehow recursively.
Error = π½ <1
2β πΎ β error 3π½2 β 2π½3
An explosion of subsequent work
Adaboost (Adaptive Boosting)
[Freund-Schapire, JCSSβ97]
Godel Prize winner 2003
βA Decision-Theoretic Generalization of On-Line Learning and an Application to Boostingβ
Informal Description Adaboost
β’ For t=1,2, β¦ ,T
β’ Construct Dt on {x1, β¦, xm}
β’ Run A on Dt producing ht: π β {β1,1} (weak classifier)
xi β π, π¦π β π = {β1,1}
+++
+
+++
+
- -
--
- -
--
ht
β’ Boosting: turns a weak algo into a strong (PAC) learner.
β’ Output Hfinal π₯ = sign Οπ‘=1πΌπ‘βπ‘ π₯
Input: S={(x1, π¦1), β¦,(xm, π¦m)};
Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.
weak learning algo A (e.g., NaΓ―ve Bayes, decision stumps)
Ο΅t = Pxi ~Dt(ht xi β yi) error of ht over Dt
Adaboost (Adaptive Boosting)
β’ For t=1,2, β¦ ,T
β’ Construct ππ on {π±π, β¦, ππ¦}
β’ Run A on Dt producing ht
Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct
β’ Weak learning algorithm A.
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘ if π¦π = βπ‘ π₯π
π·π‘+1 π =π·π‘ π
ππ‘e πΌπ‘ if π¦π β βπ‘ π₯π
Constructing π·π‘
[i.e., D1 π =1
π]
β’ Given Dt and ht set
πΌπ‘ =1
2ln
1 β ππ‘ππ‘
> 0
Final hyp: Hfinal π₯ = sign Οπ‘ πΌπ‘βπ‘ π₯
β’ D1 uniform on {x1, β¦, xm}
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘π¦π βπ‘ π₯π
Adaboost: A toy example
Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)
Adaboost: A toy example
Adaboost: A toy example
Adaboost (Adaptive Boosting)
β’ For t=1,2, β¦ ,T
β’ Construct ππ on {π±π, β¦, ππ¦}
β’ Run A on Dt producing ht
Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct
β’ Weak learning algorithm A.
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘ if π¦π = βπ‘ π₯π
π·π‘+1 π =π·π‘ π
ππ‘e πΌπ‘ if π¦π β βπ‘ π₯π
Constructing π·π‘
[i.e., D1 π =1
π]
β’ Given Dt and ht set
πΌπ‘ =1
2ln
1 β ππ‘ππ‘
> 0
Final hyp: Hfinal π₯ = sign Οπ‘ πΌπ‘βπ‘ π₯
β’ D1 uniform on {x1, β¦, xm}
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘π¦π βπ‘ π₯π
Nice Features of Adaboost
β’ Very general: a meta-procedure, it can use any weak learning algorithm!!!
β’ Very fast (single pass through data each round) & simple to code, no parameters to tune.
β’ Grounded in rich theory.
β’ Shift in mindset: goal is now just to find classifiers a bit better than random guessing.
β’ Relevant for big data age: quickly focuses on βcore difficultiesβ, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLTβ12].
(e.g., NaΓ―ve Bayes, decision stumps)
Analyzing Training Error
Theorem ππ‘ = 1/2 β πΎπ‘ (error of βπ‘ over π·π‘)
ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
So, if βπ‘, πΎπ‘ β₯ πΎ > 0, then ππππ π»πππππ β€ exp β2 πΎ2π
Adaboost is adaptive
β’ Does not need to know πΎ or T a priori
β’ Can exploit πΎπ‘ β« πΎ
The training error drops exponentially in T!!!
To get ππππ π»πππππ β€ π, need only π = π1
πΎ2log
1
πrounds
Understanding the Updates & Normalization
Prπ·π‘+1
π¦π = βπ‘ π₯π =
π:π¦π=βπ‘ π₯π
π·π‘ π
ππ‘πβπΌπ‘ =
1 β ππ‘ππ‘
πβπΌπ‘ =1 β ππ‘ππ‘
ππ‘1 β ππ‘
=1 β ππ‘ ππ‘ππ‘
Prπ·π‘+1
π¦π β βπ‘ π₯π =
π:π¦πβ βπ‘ π₯π
π·π‘ π
ππ‘ππΌπ‘ =
Probabilities are equal!
ππ‘ =
π:π¦π=βπ‘ π₯π
π·π‘ π πβπΌπ‘π¦π βπ‘ π₯π
Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.
Recall π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘π¦π βπ‘ π₯π
= 1 β ππ‘ πβπΌπ‘ + ππ‘π
πΌπ‘ = 2 ππ‘ 1 β ππ‘
=
π:π¦π=βπ‘ π₯π
π·π‘ π πβπΌπ‘ +
π:π¦πβ βπ‘ π₯π
π·π‘ π ππΌπ‘
ππ‘1
ππ‘ππΌπ‘ = ππ‘
ππ‘
1 β ππ‘ππ‘
=ππ‘ 1 β ππ‘
ππ‘
β’ If π»πππππ incorrectly classifies π₯π,
Analyzing Training Error: Proof Intuition
- Then π₯π incorrectly classified by (wtd) majority of βπ‘ βs.
β’ On round π‘, we increase weight of π₯π for which βπ‘ is wrong.
- Which implies final prob. weight of π₯π is large.
Can show probability β₯1
π
1
Οπ‘ ππ‘
β’ Since sum of prob. = 1, canβt have too many of high weight.
And (Οπ‘ ππ‘) β 0 .
Can show # incorrectly classified β€ π Οπ‘ ππ‘ .
Theorem ππ‘ = 1/2 β πΎπ‘ (error of βπ‘ over π·π‘)
ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
Step 2: errS π»πππππ β€ Οπ‘ ππ‘ .
Step 3: Οπ‘ ππ‘ = Οπ‘ 2 ππ‘ 1 β ππ‘ = Οπ‘ 1 β 4πΎπ‘2 β€ πβ2 Οπ‘ πΎπ‘
2
[Unthresholded weighted vote of βπ on π₯π ]
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
Recall π·1 π =1
πand π·π‘+1 π = π·π‘ π
exp βπ¦ππΌπ‘βπ‘ π₯π
ππ‘
π·π+1 π =exp βπ¦ππΌπβπ π₯π
ππΓ π·π π
=exp βπ¦ππΌπβπ π₯π
ππΓ
exp βπ¦ππΌπβ1βπβ1 π₯π
ππβ1Γ π·πβ1 π
β¦β¦ .
=exp βπ¦ππΌπβπ π₯π
ππΓβ―Γ
exp βπ¦ππΌ1β1 π₯π
π1
1
π
=1
π
exp βπ¦π(πΌ1β1 π₯π +β―+πΌπβπ π₯π )
π1β―ππ=
1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
errS π»πππππ =1
π
π
1π¦πβ π»πππππ π₯π
1
0
0/1 loss
exp loss
=1
π
π
1π¦ππ π₯π β€0
β€1
π
π
exp βπ¦ππ π₯π
= Οπ‘ ππ‘ .=
π
π·π+1 π ΰ·
π‘
ππ‘
Step 2: errS π»πππππ β€ Οπ‘ ππ‘ .
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
Step 2: errS π»πππππ β€ Οπ‘ ππ‘ .
Step 3: Οπ‘ ππ‘ = Οπ‘ 2 ππ‘ 1 β ππ‘ = Οπ‘ 1 β 4πΎπ‘2 β€ πβ2 Οπ‘ πΎπ‘
2
Note: recall ππ‘ = 1 β ππ‘ πβπΌπ‘ + ππ‘π
πΌπ‘ = 2 ππ‘ 1 β ππ‘
πΌπ‘ minimizer of πΌ β 1 β ππ‘ πβπΌ + ππ‘π
πΌ
β’ If π»πππππ incorrectly classifies π₯π,
Analyzing Training Error: Proof Intuition
- Then π₯π incorrectly classified by (wtd) majority of βπ‘ βs.
β’ On round π‘, we increase weight of π₯π for which βπ‘ is wrong.
- Which implies final prob. weight of π₯π is large.
Can show probability β₯1
π
1
Οπ‘ ππ‘
β’ Since sum of prob. = 1, canβt have too many of high weight.
And (Οπ‘ ππ‘) β 0 .
Can show # incorrectly classified β€ π Οπ‘ ππ‘ .
Theorem ππ‘ = 1/2 β πΎπ‘ (error of βπ‘ over π·π‘)
ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
Generalization Guarantees
G={all fns of the form sign(Οπ‘=1π πΌπ‘βπ‘(π₯)) }
π»πππππ is a weighted vote, so the hypothesis class is:
Theorem [Freund&Schapireβ97]
β π β πΊ, πππ π β€ ππππ π + ΰ·¨πππ
πT= # of rounds
Key reason: VCdππ πΊ = ΰ·¨π ππ plus typical VC bounds.
β’ H space of weak hypotheses; d=VCdim(H)
Theorem where ππ‘ = 1/2 β πΎπ‘ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
How about generalization guarantees?
Original analysis [Freund&Schapireβ97]
Generalization GuaranteesTheorem [Freund&Schapireβ97]
β π β ππ(π»), πππ π β€ ππππ π + ΰ·¨πππ
πwhere d=VCdim(H)
error
complexity
train error
generalizationerror
T= # of rounds
Generalization Guarantees
β’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.
β’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
Generalization Guarantees
β’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.
β’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
β’ These results seem to contradict FSβ87 bound and Occamβs razor (in order achieve good test error the classifier should be as
simple as possible)!
How can we explain the experiments?
Key Idea:
R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in βBoosting the margin: A new explanation for the effectiveness of voting methodsβ a nice theoretical explanation.
Training error does not tell the whole story.
We need also to consider the classification confidence!!