Post on 13-Jul-2020
(Useful) Information GeometryShane Gu and Nilesh Tripuranenisg717@cam.ac.uk,nt357@cam.ac.uk
02/04/2015
AdaBoost
I Input: a pool of weak learners/rule ∈ H, training data(xi , yi ) ∈ X × {−1, 1}, and initial sample weight distribution D0.
I Weak Learners : Find a weak rule ht ∈ H that gives the smallestweighted error εt under Dt .
I Booster : Adjust sample weights
Dt+1(i) = 1Zt
Dt(i) exp(−αtyiht(xi )) (1)
where αt = 12 ln 1−εt
εtand Zt =
∑i Dt(i) exp(−αtyiht(xi ))
I Repeat until convergence or satisfactionI Output: Strong Learner/Rule F (x) = sgn(
∑t αtht(x))
2 of 13
AdaBoost
I At the end of each round, the weight of example is (exp)proportional to its loss:
Dt+1(i) = Πtt′
exp(−yiαt′ht′(xi ))Zt′
∝ exp(−yiFt(xi ))
I Total loss is proportional to Zt :∑i
exp(−yiFt(xi )) =∑
iexp(−yi (Ft−1(xi ) + αtht(xi )))
∝∑
iDt(i) exp(−yiαtht(xi )) ≡ Zt
3 of 13
Sequential Error Minimization
Find αt and ht(xi ) to minimize Zt(αt , ht):
Zt (αt , ht ) =∑
i
Dt (i) exp(−αtyi ht (xi )) =∑
i :yi =ht (xi )
Dt (i) exp(−αt ) +∑
i :yt 6=ht (xi )
Dt (i) exp(αt )
= (exp(αt )− exp(−αt ))(N∑
i=1
Dt (i)I(yi = ht (xi )) + exp(−αt )N∑
i=1
Dt (i)
Choose ht (xi ) to minimize weighted error.
Zt (αt , ht (xi )) =∑
i
Dt (i) exp(−αtyi ht (xi )) =∑i :yt =ht (xi )
Dt (i) exp(−αt ) +∑
i :yt 6=ht (xi )
Dt (i) exp(αt )
= exp(−αt )(1− εt ) + exp(αt )εt
Choose αt such that dZtdαt
= 0 =⇒ αt = 12 ln( 1−εt
εt).
4 of 13
Orthogonality of D
So, αt chosen such that dZtdαt
= 0, and choose ht(xi ) to minimizeI(yi = ht(xi )).Then, the booster constructs a new distribution Dt+1, such that thecorrelation with ht is zero:∑
iDt+1(i)yiht(xi ) = 1
Zt
∑i
Dt(i) exp(−αtyiht(xi ))yiht(xi )
= − 1Zα
dZtdαt
= 0.
5 of 13
Alternative View of AdaBoost
I Weak Learners : Given Dt , find ht ∈ H minimizing weighted error
minht
∑i
Dt(i)I(ht(xi ) = yi )
or equivalently, minimizing weighted error.I Booster : Given ht , compute Dt+1 such that∑
iDt+1(i)yiht(xi ) = 0
i.e. is the booster pursuing a distribution D such that∑i
D(i)yihj(xi ) = 0
for every hj ∈ H ?Set of Constraints Linear in D
6 of 13
Optimization Problem for AdaBoost?
Solve:
minD
KL(D||U)
such that ∑i
D(i)yihj(xi ) = 0, ∀j
D(i) ≥ 0, ∀i∑i
D(i) = 1
Let us assume the feasible set P defined by constraints is non-empty.
7 of 13
Iterative ProjectionsI Initialize D1 = UI Choose ht ∈ H corresponding to one constraint (Weak Learner)I Find Dt+1 = argminD:
∑i D(i)yi ht (xi )=0KL(D||Dt) (Booster)
I IterateGreedy Selection of Constraints: Choose ht so that KL(Dt+1||Dt) ismaximized.
Each round of Iterative Projection is equivalent to one round ofAdaboost.
8 of 13
Equivalence ProofI Booster Using Lagrange multipliers/duality:
maxα,µminDL(α, µ,D) = KL(D||Dt) + α∑
iD(i)yiht(xi )
+µ(∑
iD(i)− 1)
0 = ∂L∂D(i) = ln D(i)
Dt(i) + 1 + αyiht(xi ) + µ
D∗(i) = Dt(i) exp(−αyiht(xi )− 1− µ) = 1Z (α)Dt(i) exp(−αyiht(xi ))
L(α) = − ln Z (α)
Choose α to minimize Zt , so D, α,Z ≡ Dt+1, αt ,Zt for same ht .I Weak Learner Find ht to maximize:
KL(Dt+1||Dt) =∑
iDt+1(i)(−αtyiht(xi )− ln Zt) = − ln Zt
Equiv to choosing ht to minimize Zt
9 of 13
Convergence of AdaBoost
I Recall that P is the feasible set of constraints, and define Q as the set ofD ∝ exp(−
∑Ni=1 λj yi hj (xi )). If d ∈ P ∩Q then by Pythagorean theorem (as before)
d uniquely solves minp∈P KL(p||U).I Dt computed by iterative projection converges to unique point d ∈ P ∩Q
By Pythagorean Theorem:
KL(D∗,Dt+1) = KL(D∗,Dt )− KL(Dt+1,Dt )
we are always getting closer!
I i.e. the loss ≥ 0 and non-increasing, so drop in loss must convergeto 0.
I Moreover if drop in loss = 0, then D ∈ PI Construction of d implies D∗ ∈ Q
10 of 13
Duality
Minimizing the exponential loss E(exp(−yF (x))) is the convexdual of solving the KL-projection problem subject to linearconstraints.
11 of 13
Afterthoughts and Bregman Divergences
Why and When does this work?For convex function F , the induced Bregman divergence is:
BF (p||q) = F (p)− F (q)−∇F (q)(p − q)
Bregman Divergences are in 1-to-1 correspondence with exponentialfamilies (i.e. contours of equal density define the Bregman distance)Theorem: For a large family of Bregman divergences, there exists aunique d∗ satisfying
I d∗ ∈ P ∩QI d∗ = argminp∈PBF (p||q0)I Pythagorean Theorem
12 of 13
CitationsR.E. Schapire, Y. Freund, Boosting: Foundations and AlgorithmsM. Collins, R. E. Schapire, and Y. Singer, ”Logistic regression,adaboost and bregman distances,” Machine Learning, vol. 48, no. 1-3,pp. 253-285, 2002.
13 of 13