Machine Learning

1

Machine Learning

CSE 5095: Special Topics Course

Boosting

Nhan NguyenComputer Science and Engineering Dept.

2

Boosting

Method for converting rules of thumb into a prediction rule.

Rule of thumb? Method?

3

Binary Classification

X: set of all possible instances or examples. - e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast c: X {0,1}: the target concept to learn.

- e.g., c: EnjoySport {0,1} H: set of concept hypotheses

- e.g., conjunctions of literals: <?,Cold,High,?,?,?> C: concept class, a set of target concept c. D: target distribution, a fixed probability

distribution over X. Training and test examples are drawn according to D.

4

Binary Classification

S: training sample<x1,c(x1)>,…,<xm,c(xm)>

The learning algorithm receives sample S and selects a hypothesis from H approximating c.

- Find a hypothesis hH such that h(x) = c(x) x S

5

Errors

True error or generalization error of h with respect to the target concept c and distribution D:

[h] =

Empirical error: average error of h on the training sample S drawn according to distribution D,[h] = =

6

Errors

Questions:– Can we bound the true error of a hypothesis

given only its training error?– How many examples are needed for a good

approximation?

7

Approximate Concept Learning

Requiring a learner to acquire the right concept is too strict

Instead, we will allow the learner to produce a good approximation to the actual concept

8

General Assumptions

Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x)

Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them).

Goal: h should have a low error rate on new examples drawn from the same distribution D.

[h] =

9

PAC learning model

PAC learning: Probably Approximately Correct learning

The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to c and we want to have small error of h [h]

If [h] is small, h is “probably approximately correct”. Formally, h is PAC if

Pr[[h] 1 - for all c C, > 0, > 0, and all distributions D

10

PAC learning model

Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that:

Pr[[h] 1 - uses at most poly(1/1/, size(X), size(c)) examples and running time. : accuracy, 1 - : confidence.

Such an L is called a strong Learner

11

PAC learning model

Learner L is a weak learner if learner L output a hypothesis h H such that:

Pr[[h] ( - 1 - for all c C, > 0, > 0, and all distributions D

A weak learner only output an hypothesis that performs slightly better than random guessing

Hypothesis boosting problem: can we “boost” a weak learner into a strong learner?

Rule of thumb ~ weak leaner Method ~

12

Boosting a weak learner – Majority Vote

L leans on first N training points L randomly filters the next batch of training

points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2.

L builds a third training set of N points for which h1 and h2 disagree, and produces h3.

L outputs h = Majority Vote(h1, h2, h3)

13

Boosting [Schapire ’89]

Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote. A formal description of boosting:

o Given training set ((…, (o {-1, +1}: correct label of Xo for t = 1, …, T:o construct distribution on {1, …, m}o find weak hypothesis : X {-1, +1} with small error on o output final hypothesis

14

Boosting

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

.

.

.

(x)

(x)

(x)

(x)

Final hypothesisH(x) = sign[]

15

Boosting algorithms

AdaBoost (Adaptive Boosting) LPBoost (Linear Programming Boosting) BrownBoost MadaBoost (modifying the weighting system of

AdaBoost) LogitBoost

16

Lecture

Motivating example Adaboost Training Error Overfitting Generalization Error Examples of Adaboost Multiclass for weak learner

17

Thank you!

18

Machine Learning

Proof of Bound on Adaboost Training ErrorAaron Palmer

19

Theorem: 2 Class Error Bounds

Assume t = - t

= error rate on round of boosting = how much better than random guessing

small, positive number

Training error is bounded by Hfinal

20

Implications?

= number of rounds of boosting and do not need to be known in advance

As long as then the training error will decrease exponentially as a function of

21

Proof Part I: Unwrap Distribution

Let

= = =

=

22

Proof Part II: training error

Training error () = =

= =

23

Proof Part III:

Set equal to zero and solve for

Plug back into

24

Part III: Continued

Plug in the definition t t

25

Exponential Bound

Use property

Take x to be 1

26

Putting it together

We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined

Bound is pretty loose

27

Example:

Suppose that all are at least 10% so that no has an error rate above 40%

What upper bound does the theorem place on the training error?

Answer:

28

Overfitting?

Does the proof say anything about overfitting?

While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?

BoostingAyman Alharbi

Example (Spam emails)

* problem: filter out spam(junk email)

- Gather large collection of examples of spam and non-spam

From: [email protected] “can you review a paper” ... non-spam

From: [email protected] “Win 10000$ easily !!” ... spam

mailto:[email protected]

mailto:[email protected]

Example (Spam emails)

Main Observation - Easy to find “rules of thumb” that are “often” correct

- Hard to find single rule that is very highly accurate

If ‘buy now’ occurs in message, then predict ‘spam’

Example (Phone Cards)

Goal: automatically categorize type of call requested by phone customer

(Collect, CallingCard, PersonToPerson, etc.)- Yes I’d like to place a collect call long distance please

(Collect)

- operator I need to make a call but I need to bill it to my office

(ThirdNumber)

- I just called the wrong and I would like to have that taken off of my bill

(BillingCredit)

Main Observation Easy to find

“rules of thumb” that are “often” correct

If ‘bill’ occurs in utterance, then predict ‘BillingCredit’

Hard to find single rule that is very highly accurate

The Boosting ApproachDevise computer program for deriving rough

rules of thumbApply procedure to subset of emailsObtain rule of thumbApply to 2nd subset of emailsObtain 2nd rule of thumbRepeat T times

Details How to choose examples on each round?

- Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

How to combine rules of thumb into single prediction rule?

- Take (weighted) majority vote of rules of thumb !!

Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?

Idea

• At each iteration t :– Weight each training example by how incorrectlyit was classified– Learn a hypothesis – ht– Choose a strength for this hypothesis – αtFinal classifier: weighted combination ofweak learners

Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

36

Boosting

AdaBoost AlgorithmPresenter: Karl Severin

Computer Science and Engineering Dept.

37

Boosting Overview

Goal: Form one strong classifier from multiple weak classifiers.

Proceeds in rounds iteratively producing classifiers from weak learner.

Increases weight given to incorrectly classified examples.

Gives importance to classifier that is inversely proportional to its weighted error.

Each classifier gets a vote based on its importance.

38

Initialize

Initialize with evenly weighted distribution

Begin generating classifiers

39

Error

Quality of classifier based on weighted error:

Probability ht will misclassify an example selected according distribution Dt

Or summation of the weights of all misclassified examples

40

Classifier Importance

αt measures the importance given to classifier ht

αt > 0 if ε t < ( ε t assumed to always be < )

αt is inversely proportional to ε t

41

Update Distribution

Increase weight of misclassified examples

Decrease weight of correctly classified examples

42

Combine Classifiers

When classifying a new instance x all of the weak classifiers get a vote weighted by their α

43

Review

44

Questions?

?

45

Machine Learning

SCE 5095: Special Topics Course

Instructor: Jinbo Bi


Presenter: Brian McClanahan

Topic: Boosting Generalization Error

46

Generalization Error

Generalization error is the true error of a classifier

Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error

For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound

47

Generalization Error First Bound

empirical risk (training error) – boosting rounds – VC Dimension of base classifiers – number of training examples - generalization error

48

Intuition of Bound: Hoeffding’s inequality

Define to be a finite set of hypothesis which map examples to 0 or 1

Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have:

In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .

49

Intuition of Bound: Hoeffding’s inequality

So

and by Hoeffding’s inequality:

50

Intuition of Bound

If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using .

So will hold with probability

51

Intuition of Bound:Bounding all hypotheses in set

So we have bounded the difference between the generalization error and the empirical risk for a single hypothesis

How do we bound the difference for all hypotheses in

Theorem: Let H be a finite space of hypothesis and assume that a random training set of size m is chosen. Then for any .

Again by setting and solving for we have

Will hold with probability

52

Intuition of Bound:Bounding Hypotheses in Infinite Set

What about cases when is infinite Even if is infinite given a set of examples the hypothesis

in may only be capable of labeling the examples in a number of ways <

This implies that though is infinite the hypotheses are divided into classes which produce the same labelings. So the effective number of hypotheses is equal to the number of classes

By using arguments similar to those above in Hoeffding’s inequality can be replaced with the number of effective hypotheses(or dichotomies) along with some additional constants

53


More formally let be a finite set of examples and define the set of dichotomies to be all possible labelings of by hypotheses of

Also define the growth function to be the function

which measures the maximum number of dichotomies for any sample of size

54


Theorem: Let be any space of hypotheses and assume that a random training set of size is chosen. Then for any

and with probability at least

For all

55


It turns out that the growth function is either polynomial in m or

In the cases where the growth function is polynomial in the degree of the polynomial is the VC dimension of

In the case when the grown function is the VC dimension if infinite

VC dimension –The maximum number of points which can be shattered by . points are said to be shattered if the can realize all possible labelings of the points

56


The VC dimension turns out to be a very natural measure for the complexity of and can be used to bound the growth function given examples

Saucer’s lemma: If is a hypothesis class of VC dimension , then for all m when

57


Theorem: Let be a hypothesis space of VC dimension and assume that a random training set of size . Then for any

Thus with probability at least

58

Intuition of Bound: Adaboost Generalization Error

The first bound for the Adaboost generalization error follows from Saucer’s Lemma in a similar way

The growth function for Adaboost is bounded using the VC dimension of the base classifiers and the VC dimension for the set of all possible base classifier weight combinations

Again these bounds can be used in conjunction with Hoeffding’s inequality to bound the generalization error

59

Generalization Error First Bound

Think of T and d as variables which come together to define the complexity of the ensemble of hypotheses. The log factors and constants are ignored.

This bound implies that AdaBoost is likely to overfit if run for to many boosting rounds.

Contradicting the spirit of this bound, empirical results suggest that AdaBoost tends to decrease generalization error even after training error reaches zero.

60

Example

The graph shows boosting used on C4.5 to identify images of hand written characters. This experiment was carried out by Robert Schapire et al.

Even as the training error goes to zero and AdaBoost has been run for 1000 rounds the test error continues to decrease

Error

Rounds

Training

TestC4.5 test error

61

Margin

AdaBoost’s resistance to overfitting is attributed to the confidence with which predictions are made

This notion of confidence is quantified by the margin

The margin takes values between 1 and -1

The magnitude of the margin can be viewed as a measure of confidence

62

Generalization Error

In response to empirical findings Schapire et al. derived a new bound.

This new bound is defined in terms of the margins, VC dimension and sample size but not the number of rounds or training error

This bound suggests that higher margins are preferable for lower generalization error

63

Relation to Support Vector Machines

The boosting margins theory turns out to have a strong connection with support vector machines

Imagine that we have already found all weak classifiers and we wish to find the optimal set of weights with which to combine them

The optimal weights would be the weights that maximize the minimum margin

64


Both support vector machines and boosting can be seen as trying to optimize the same objective function

Both attempt to maximize the minimum margin The difference is the norms used by Boosting and

SVM

Boosting Norms SVM Norms

‖𝛼‖𝟏=∑𝒕

|𝜶𝒕|,‖𝒉(𝒙)‖∞=𝒎𝒂𝒙𝒕∨𝒉𝒕 (𝒙 )∨¿‖𝛼‖𝟐=√∑𝒕 𝜶𝒕𝟐 ,‖𝒉(𝒙)‖𝟐=√∑𝒕 𝒉𝒕(𝒙)𝟐

65


Effects of different norms– Different norms can lead to very different results,

especially in high dimensional spaces– Different computation requirements

SVM corresponds to quadratic programming while AdaBoost corresponds to linear programming

– Difference in finding linear classifiers for high dimensional spaces

SVMs use kernals to perform low dimensional calculations which are equivalent to inner products in high dimensions

Boosting employs greedy search by using weight redistributions and weak classifiers to coordinates highly correlated with sample labels

66

AdaBoost Examples and Results

SCE 5095: Special Topics Course

Yousra AlmathamiComputer Science and Engineering Dept.

67

The Rules for Boosting

1. Set all weights of training examples equal

2. Train a weak learner on the weighted examples

3. Check how well the weak learner performs on data and give it a weight based on how well it did

4. Re-weight training examples and repeat

5. When done, predict by voting by majority

68

Overview of Adaboost

Taken from Bishop

69

Toy Example

Taken from Schapire

1.5 Positive examples2.5 Negative examples3.2-Dimensional plane4.Weak hyps: linear separators5.3 iterations6. All given equal weights

70

First classifier

Taken from Schapire

Misclassified examples are circled, given more weight

71

First 2 classifiers

Taken from Schapire


72

First 3 classifiers

Taken from Schapire


73

Final Classifier learned by Boosting

Taken from Schapire

Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.

74

MATLAB CODE FOR AdaBoost

75

Cardiac Ultrasound Videos

Class Training Test Boosting cycles

A2C 20 9 200

A4C 20 10 200

76

Breast Cancer Boosting Results

Class Training Test Boosting cycles

Benign 100 100 200

Malignant 100 100 200

77

Boosting Demo

Online Demo taken from www.Mathworks.com by Richard Stapenhurst

http://www.Mathworks.com/

78

Machine Learning

Multiclass Classification for Boosting

Presented By: Chris Kuhn


79

The Idea

Everything covered has been the binary two class classification problem, what happens when dealing with more than two classes?

What changes in the problem? y = {-1,+1} → y = {1, 2, …, k} Random guess value changes from ½ to 1/k

Weak learning classifiers need to be updated

Can we update the weak learning classifiers to just have an accuracy > 1/k + γ?

There are cases where this condition is satisfied but there is no way to drive training error to 0 making boosting impossible

THIS IS TOO WEAK!

80

AdaBoost.M1

81

AdaBoost.M1

Almost the same algorithm as regular AdaBoost Advantage:

Works similar to binary AdaBoost but on multiclass problems

Disadvantage: If weak hypothesis has error slightly better

than ½ then boosting is possible For k = 2, slightly better than ½ represents a

better than random guess, what about k > 2? TOO STRONG! (unless weak learner is strong)

82

An Alternative Approach

Can we create multiple binary problems out of a multiclass problem?

For example xi is the correct label yi or y` K – 1 binary problems for each example

h(x,y) = 1 if y is the label for x, 0 otherwise

h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)

h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)

h(xi, yi)=h(xi, y`) → uninformative

83

An Alternative Approach

Can we create multiple binary problems out of a multiclass problem?

For example xi is the correct label yi or y` K – 1 binary problems for each example

h(x,y) = 1 if y is the label for x, 0 otherwise

h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)

h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)

h(xi, yi)=h(xi, y`) → uninformative

84

AdaBoost.M2

85

AdaBoost.MR

Generalized to allow multiple labels per example Different initial distribution ht : X x Y → Real Number ht used to rank labels for a given example Now have ranking loss instead of error rate

86

AdaBoost.MR

87

Additional Algorithms

AdaBoost.MH One-against-all Requires strong weak learning conditions

AdaBoost.MO

Runs MH as part of algorithm and uses strong classifier to generate alternative strong classifiers which can perform and extra voting step

Still requires a strong weak learning condition SAMME

Allows weak learners slightly better than random, Cost matrix instead of weights and a different equation for weak classifier combination

Conditions can be too weak for strong margins

88

Take Home

Yes, it is possible There are many multiclass boosting algorithms

available No, there is no 'one size fits all' multiclass

algorithm

Machine Learning

Documents

Transcript of Machine Learning