A decision-theoretic generalization of on-line learning...

AdaBoost Algorithm Upper Bound for Adaboost Algorithm Experiment Evaluation Generalization Analysis

A decision-theoretic generalization of on-linelearning and an application to boosting [1]

From Regret Learning to AdaBoost

Xing Wang

Department of Computer Science, TAMU

Date: May 6, 2015


Table of Contents

1 AdaBoost Algorithm

2 Upper Bound for Adaboost Algorithm

3 Experiment EvaluationExperiment 1Experiment 2

4 Generalization Analysis


External Regret Learning

Initial∑N

i=1 w1i = 1, w1

i ∈ [0, 1];for t = 1 . . .T do

get pt = w t/∑

w ti ;

receive loss vector l t ∈ [0, 1]N ;suffer loss pt · l t ;update weight w t+1

i = w ti β

l ti

endAlgorithm 1: PW Algorithm


From Regret to Adaptive Boosting

input : N labeled samples (x1, y1), . . .distribution D over N samplesweak learning algorithm WeakLearn

Initial w1i = D(i);

for t = 1 . . .T doprovide WeakLearn with distribution pt = w t/

∑w ti over

samples, get a hypothesis ht : X → [0, 1];

calculate the error of ht , εt =∑N

i=1 pti |ht(xi )− yi |;

set βt = εt/(1− εt), update weight vector

w t+1i = w t

i β1−|ht(xi )−yi |t ;

endAlgorithm 2: Adaboost

hf (x) =

{1 if

∑Tt=1 ht(x)log(1/βt) ≥ 0.5

∑Tt=1 log(1/βt)

0 otherwise(1)


Table of Contents






Error Bound for AdaBoost

Theorem The error for the final hypothesis hf ,ε =

∑i :hf (xi )6=yi

D(i), is bounded by ε ≤∏T

t=1 2√εt(1− εt)

Figure 1: 2√εt(1− εt)


Theorem proof, part 1

wT+1i = D(i)

T∏t=1

β1−|ht(xi )−yi |t (2)

hf (x) = 1 if∑T

t=1 ht(x)log(1/βt) ≥ 0.5∑T

t=1 log(1/βt).Then hf makes mistake (hf (x) 6= yi ) is equivalent to

T∏t=1

β−|ht(xi )−yi |t ≥ (

T∏t=1

βt)−1/2 (3)

plug 3 in 2 for the mislabeled samples, we have

N∑i

wT+1i ≥

∑i :hf (x) 6=yi

wT+1i ≥ (

∑i :hf (x)6=yi

D(i))(T∏t=1

βt)1/2 = ε(

T∏t=1

βt)1/2

(4)



Based on 4 ε(∏T

t=1 βt)1/2 ≤

∑Ni=1 w

T+1i and 6∑N

i=1 wT+1i ≤

∏Tt=1(1− (1− εt)(1− βt)), we get:

ε ≤T∏t=1

1− (1− εt)(1− βt)√βt

(7)

The right part get minimal value when βt = εt/(1− εt), plug inthis value and finish the proof ε ≤ 2T

∏Tt=1

√εt(1− εt).


Table of Contents






Experiment settings

Two dataset:

DRIVE [2] retinal image, blood vessel vs backgroundUCI [4] Japanese credit screening dataset

Decision Tree as weak learner,

package from sklearnmax depth of 4initial sample weight w1

i = 0.5/|j : lj == li |


Retinal blood vessel / background classification

(a) (b)

20 training images, a total of 4,541,006 pixels, 569,415 bloodvessel pixel.

two shape features, energy and symmetry derived from daisygraph [3].


Evalution results on the Retina Image

Figure 2: εt → 0.5, βt → 1

There is little update on the sample weightlog(1/βt)→ 0, the corresponding classifier contribute less.2√εt(1− εt)→ 1, no reduce on 2T

∏Tt=1

√εt(1− εt)


Credit Screening

UCI Japanese Credit Screening : http://goo.gl/4gBRXb, 532samples.

Feature used : 2,3,8,11,14,15. six continuous features.

Class label: +(296)/-(357)


Evalution results on the Credit Screening

εt of each round is below 0.4ε on training set converge to 0 after 40 rounds.


Table of Contents






PAC framework and VC dimension

Based on [5], with probability 1− δ,

errortrue(h) ≤ errortrain(h) +

√ln(H) + ln(1/δ)

2m(8)

H is the VC dimension of the hypothesis class

m is the sample numbers

δ = 0.05 for later analysis


VC dimension

VC dimension of a class of hypothesis is the largest number ofsamples, any assignment of label to the samples could beseperated by one hypotheiss in the hypothesis class.Example: In one-dimension, with hypothesis class as{+/− x > a}.

exists two samples, always separable

any three samples, exist one label assignment not separable


VC dimension of decision tree

The VC dimesion of decision tree of depth k on n-dimension spaceis bounded by:

Lower bound: 2k−1(n + 1)

Upper bound[5]: 2(2n)2k−1


VC dimension of the Adaboost

Let H be the class of hypothesis given by the WeakLearner withVC dimension d ≥ 2, then the VC dimesion of the hypothesis giverby Adaboost after T rounds is at most

2(d + 1)(T + 1)log2(e(T + 1)) (9)


Mean of Leave-one-out generalization test

Figure 3: Generalization test on Credit Screening

The optimal interation number given by PAC framework isless than the optimal iterations needed.Consistent with the papers results.


Thanks, Q&A


Reference I

Freund, Yoav, and Robert E. Schapire. ”A decision-theoreticgeneralization of on-line learning and an application toboosting.” Journal of computer and system sciences 55.1(1997): 119-139.

J.J. Staal, M.D. Abramoff, M. Niemeijer, M.A. Viergever, B.van Ginneken, ”Ridge based vessel segmentation in colorimages of the retina”, IEEE Transactions on Medical Imaging,2004, vol. 23, pp. 501-509.

Huajun, Ying, Wang Xing, and Liu Jyh-Charn. ”Statisticalpattern analysis of blood vessel features on retina images andits application to blood vessel mapping algorithms.” EMBC2014.


Reference II

Lichman, M. (2013). UCI Machine Learning Repository[http://archive.ics.uci.edu/ml]. Irvine, CA: University ofCalifornia, School of Information and Computer Science.

Luke Zettlemoyer. PAC-learning, VC Dimesion. UW, 2012.


Iteration statistic

There are cases the boost iterate less than 40 times. Theiteration ends because the error rate does not change.

A decision-theoretic generalization of on-line learning...

Documents

Transcript of A decision-theoretic generalization of on-line learning...