A decision-theoretic generalization of on-line learning...

AdaBoost Algorithm Upper Bound for Adaboost Algorithm Experiment Evaluation Generalization Analysis

A decision-theoretic generalization of on-linelearning and an application to boosting [1]

From Regret Learning to AdaBoost

Xing Wang

Department of Computer Science, TAMU

Date: May 6, 2015

Table of Contents

1 AdaBoost Algorithm

2 Upper Bound for Adaboost Algorithm

3 Experiment EvaluationExperiment 1Experiment 2

4 Generalization Analysis

External Regret Learning

Initial∑N

i=1 w1i = 1, w1

i ∈ [0, 1];for t = 1 . . .T do

get pt = w t/∑

w ti ;

receive loss vector l t ∈ [0, 1]N ;suffer loss pt · l t ;update weight w t+1

i = w ti β

endAlgorithm 1: PW Algorithm

From Regret to Adaptive Boosting

input : N labeled samples (x1, y1), . . .distribution D over N samplesweak learning algorithm WeakLearn

Initial w1i = D(i);

for t = 1 . . .T doprovide WeakLearn with distribution pt = w t/

∑w ti over

samples, get a hypothesis ht : X → [0, 1];

calculate the error of ht , εt =∑N

i=1 pti |ht(xi )− yi |;

set βt = εt/(1− εt), update weight vector

w t+1i = w t

i β1−|ht(xi )−yi |t ;

endAlgorithm 2: Adaboost

hf (x) =

∑Tt=1 ht(x)log(1/βt) ≥ 0.5

∑Tt=1 log(1/βt)

0 otherwise(1)

Table of Contents

Error Bound for AdaBoost

Theorem The error for the final hypothesis hf ,ε =

∑i :hf (xi )6=yi

D(i), is bounded by ε ≤∏T

t=1 2√εt(1− εt)

Figure 1: 2√εt(1− εt)

Theorem proof, part 1

wT+1i = D(i)

T∏t=1

β1−|ht(xi )−yi |t (2)

hf (x) = 1 if∑T

t=1 ht(x)log(1/βt) ≥ 0.5∑T

t=1 log(1/βt).Then hf makes mistake (hf (x) 6= yi ) is equivalent to

T∏t=1

β−|ht(xi )−yi |t ≥ (

T∏t=1

βt)−1/2 (3)

plug 3 in 2 for the mislabeled samples, we have

wT+1i ≥

∑i :hf (x) 6=yi

wT+1i ≥ (

∑i :hf (x)6=yi

D(i))(T∏t=1

βt)1/2 = ε(

T∏t=1

βt)1/2

N∑i=1

w t+1i =

N∑i=1

w ti β

1−|ht(xi )−yi |t

≤N∑i=1

w ti (1− (1− β)(1− |ht(xi )− yi |))

= (N∑i=1

w ti )(1− (1− εt)(1− βt))

where εt =∑N

i w ti |ht(xi )− yi |/

∑Nj=1 w

N∑i=1

wT+1i ≤

N∑i=1

T∏t=1

(1− (1− εt)(1− βt))

≤T∏t=1

(1− (1− εt)(1− βt))

Based on 4 ε(∏T

t=1 βt)1/2 ≤

∑Ni=1 w

T+1i and 6∑N

i=1 wT+1i ≤

∏Tt=1(1− (1− εt)(1− βt)), we get:

ε ≤T∏t=1

1− (1− εt)(1− βt)√βt

The right part get minimal value when βt = εt/(1− εt), plug inthis value and finish the proof ε ≤ 2T

∏Tt=1

√εt(1− εt).

Table of Contents

Experiment settings

Two dataset:

DRIVE [2] retinal image, blood vessel vs backgroundUCI [4] Japanese credit screening dataset

Decision Tree as weak learner,

package from sklearnmax depth of 4initial sample weight w1

i = 0.5/|j : lj == li |

Retinal blood vessel / background classification

(a) (b)

20 training images, a total of 4,541,006 pixels, 569,415 bloodvessel pixel.

two shape features, energy and symmetry derived from daisygraph [3].

Evalution results on the Retina Image

Figure 2: εt → 0.5, βt → 1

There is little update on the sample weightlog(1/βt)→ 0, the corresponding classifier contribute less.2√εt(1− εt)→ 1, no reduce on 2T

∏Tt=1

√εt(1− εt)

Credit Screening

UCI Japanese Credit Screening : http://goo.gl/4gBRXb, 532samples.

Feature used : 2,3,8,11,14,15. six continuous features.

Class label: +(296)/-(357)

Evalution results on the Credit Screening

εt of each round is below 0.4ε on training set converge to 0 after 40 rounds.

Table of Contents

PAC framework and VC dimension

Based on [5], with probability 1− δ,

errortrue(h) ≤ errortrain(h) +

√ln(H) + ln(1/δ)

H is the VC dimension of the hypothesis class

m is the sample numbers

δ = 0.05 for later analysis

VC dimension

VC dimension of a class of hypothesis is the largest number ofsamples, any assignment of label to the samples could beseperated by one hypotheiss in the hypothesis class.Example: In one-dimension, with hypothesis class as{+/− x > a}.

exists two samples, always separable

any three samples, exist one label assignment not separable

VC dimension of decision tree

The VC dimesion of decision tree of depth k on n-dimension spaceis bounded by:

Lower bound: 2k−1(n + 1)

Upper bound[5]: 2(2n)2k−1

VC dimension of the Adaboost

Let H be the class of hypothesis given by the WeakLearner withVC dimension d ≥ 2, then the VC dimesion of the hypothesis giverby Adaboost after T rounds is at most

2(d + 1)(T + 1)log2(e(T + 1)) (9)

Mean of Leave-one-out generalization test

Figure 3: Generalization test on Credit Screening

The optimal interation number given by PAC framework isless than the optimal iterations needed.Consistent with the papers results.

Thanks, Q&A

Reference I

Freund, Yoav, and Robert E. Schapire. ”A decision-theoreticgeneralization of on-line learning and an application toboosting.” Journal of computer and system sciences 55.1(1997): 119-139.

J.J. Staal, M.D. Abramoff, M. Niemeijer, M.A. Viergever, B.van Ginneken, ”Ridge based vessel segmentation in colorimages of the retina”, IEEE Transactions on Medical Imaging,2004, vol. 23, pp. 501-509.

Huajun, Ying, Wang Xing, and Liu Jyh-Charn. ”Statisticalpattern analysis of blood vessel features on retina images andits application to blood vessel mapping algorithms.” EMBC2014.

Reference II

Lichman, M. (2013). UCI Machine Learning Repository[http://archive.ics.uci.edu/ml]. Irvine, CA: University ofCalifornia, School of Information and Computer Science.

Luke Zettlemoyer. PAC-learning, VC Dimesion. UW, 2012.

Iteration statistic

There are cases the boost iterate less than 40 times. Theiteration ends because the error rate does not change.

A decision-theoretic generalization of on-line learning...

Documents

Transcript of A decision-theoretic generalization of on-line learning...

2 the Only Generalization is - There is No Generalization

Measure-theoretic chaos

Generalization and Exploration via Randomized Value Functions · Generalization and Exploration via Randomized Value ... optimality in a tabula rasa learning ... Generalization and

Dempster68 Generalization BayesianInference

Generalization and Scaling in Reinforcement Learningpapers.nips.cc/paper/208-generalization-and... · Generalization and Scaling in Reinforcement Learning 553 with each input. When

On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire.

Quality Assessment in the framework of Map Generalization · Generalization Abstraction Cartographic Generalization •Visualization •Cartographic Communication •Harmonization

Explanation-Based Generalization: A - Stanford Universityjb732xy4310/jb732xy4310.pdf · explanation-based generalization 49 generalization,and areable toextract more information from

Introduction - PERSONAL WEB PAGE DISCLAIMERucaheps/papers/K-theoretic and categorical... · This is an equivariant generalization of a result of Kawa-mata [12], with a di erent proof:

SUG243 - Cartography (Generalization)

Generalization Web Services

A Natural Generalization of the Congruent Number Problemlrolen/congruent_numbers_beamer.pdf · A Natural Generalization of the Congruent Number Problem A Natural Generalization Deﬁnition

Proof-theoretic type interpretation: a glimpse to proof-theoretic …phonetics.linguistics.ucla.edu/wpl/issues/wpl17/papers/... · 2017. 3. 14. · Proof-theoretic type interpretation:

Generalization and Example

MTT-Semantics Is Model-Theoretic As Well As Proof-Theoretic · 2020. 4. 22. · MTT-Semantics Is Model-Theoretic As Well As Proof-Theoretic Zhaohui Luoy Royal Holloway, University

Recruitment Generalization

Stimulus Generalization

Lecture 08 Generalization

Generalization abstraction

Generalization and Overfitting