Machine Learning

88
1 Machine Learning CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept.

description

Machine Learning. CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept. Boosting. Method for converting rules of thumb into a prediction rule. Rule of thumb ? Method?. Binary Classification. X: set of all possible instances or examples. - PowerPoint PPT Presentation

Transcript of Machine Learning

Page 1: Machine Learning

1

Machine Learning

CSE 5095: Special Topics Course

Boosting

Nhan NguyenComputer Science and Engineering Dept.

Page 2: Machine Learning

2

Boosting

Method for converting rules of thumb into a prediction rule.

Rule of thumb? Method?

Page 3: Machine Learning

3

Binary Classification

X: set of all possible instances or examples. - e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast c: X {0,1}: the target concept to learn.

- e.g., c: EnjoySport {0,1} H: set of concept hypotheses

- e.g., conjunctions of literals: <?,Cold,High,?,?,?> C: concept class, a set of target concept c. D: target distribution, a fixed probability

distribution over X. Training and test examples are drawn according to D.

Page 4: Machine Learning

4

Binary Classification

S: training sample<x1,c(x1)>,…,<xm,c(xm)>

The learning algorithm receives sample S and selects a hypothesis from H approximating c.

- Find a hypothesis hH such that h(x) = c(x) x S

Page 5: Machine Learning

5

Errors

True error or generalization error of h with respect to the target concept c and distribution D:

[h] =

Empirical error: average error of h on the training sample S drawn according to distribution D,[h] = =

Page 6: Machine Learning

6

Errors

Questions:– Can we bound the true error of a hypothesis

given only its training error?– How many examples are needed for a good

approximation?

Page 7: Machine Learning

7

Approximate Concept Learning

Requiring a learner to acquire the right concept is too strict

Instead, we will allow the learner to produce a good approximation to the actual concept

Page 8: Machine Learning

8

General Assumptions

Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x)

Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them).

Goal: h should have a low error rate on new examples drawn from the same distribution D.

[h] =

Page 9: Machine Learning

9

PAC learning model

PAC learning: Probably Approximately Correct learning

The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to c and we want to have small error of h [h]

If [h] is small, h is “probably approximately correct”. Formally, h is PAC if

Pr[[h] 1 - for all c C, > 0, > 0, and all distributions D

Page 10: Machine Learning

10

PAC learning model

Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that:

Pr[[h] 1 - uses at most poly(1/1/, size(X), size(c)) examples and running time. : accuracy, 1 - : confidence.

Such an L is called a strong Learner

Page 11: Machine Learning

11

PAC learning model

Learner L is a weak learner if learner L output a hypothesis h H such that:

Pr[[h] ( - 1 - for all c C, > 0, > 0, and all distributions D

A weak learner only output an hypothesis that performs slightly better than random guessing

Hypothesis boosting problem: can we “boost” a weak learner into a strong learner?

Rule of thumb ~ weak leaner Method ~

Page 12: Machine Learning

12

Boosting a weak learner – Majority Vote

L leans on first N training points L randomly filters the next batch of training

points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2.

L builds a third training set of N points for which h1 and h2 disagree, and produces h3.

L outputs h = Majority Vote(h1, h2, h3)

Page 13: Machine Learning

13

Boosting [Schapire ’89]

Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote. A formal description of boosting:

o Given training set ((…, (o {-1, +1}: correct label of Xo for t = 1, …, T:o construct distribution on {1, …, m}o find weak hypothesis : X {-1, +1} with small error on o output final hypothesis

Page 14: Machine Learning

14

Boosting

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

.

.

.

(x)

(x)

(x)

(x)

Final hypothesisH(x) = sign[]

Page 15: Machine Learning

15

Boosting algorithms

AdaBoost (Adaptive Boosting) LPBoost (Linear Programming Boosting) BrownBoost MadaBoost (modifying the weighting system of

AdaBoost) LogitBoost

Page 16: Machine Learning

16

Lecture

Motivating example Adaboost Training Error Overfitting Generalization Error Examples of Adaboost Multiclass for weak learner 

Page 17: Machine Learning

17

Thank you!

Page 18: Machine Learning

18

Machine Learning

Proof of Bound on Adaboost Training ErrorAaron Palmer

Page 19: Machine Learning

19

Theorem: 2 Class Error Bounds

Assume t = - t

= error rate on round of boosting = how much better than random guessing

small, positive number

Training error is bounded by Hfinal

Page 20: Machine Learning

20

Implications?

= number of rounds of boosting and do not need to be known in advance

As long as then the training error will decrease exponentially as a function of

Page 21: Machine Learning

21

Proof Part I: Unwrap Distribution

Let

= = =

=

Page 22: Machine Learning

22

Proof Part II: training error

Training error () = =

= =

Page 23: Machine Learning

23

Proof Part III:

Set equal to zero and solve for

Plug back into

Page 24: Machine Learning

24

Part III: Continued

Plug in the definition t t

Page 25: Machine Learning

25

Exponential Bound

Use property

Take x to be 1

Page 26: Machine Learning

26

Putting it together

We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined

Bound is pretty loose

Page 27: Machine Learning

27

Example:

Suppose that all are at least 10% so that no has an error rate above 40%

What upper bound does the theorem place on the training error?

Answer:

Page 28: Machine Learning

28

Overfitting?

Does the proof say anything about overfitting?

While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?

Page 29: Machine Learning

BoostingAyman Alharbi

Page 30: Machine Learning

Example (Spam emails)

* problem: filter out spam(junk email)

- Gather large collection of examples of spam and non-spam

From: [email protected] “can you review a paper” ... non-spam

From: [email protected] “Win 10000$ easily !!” ... spam

Page 31: Machine Learning

Example (Spam emails)

Main Observation - Easy to find “rules of thumb” that are “often” correct

- Hard to find single rule that is very highly accurate

If ‘buy now’ occurs in message, then predict ‘spam’

Page 32: Machine Learning

Example (Phone Cards)

Goal: automatically categorize type of call requested by phone customer

(Collect, CallingCard, PersonToPerson, etc.)- Yes I’d like to place a collect call long distance please

(Collect)

- operator I need to make a call but I need to bill it to my office

(ThirdNumber)

- I just called the wrong and I would like to have that taken off of my bill

(BillingCredit)

Main Observation Easy to find

“rules of thumb” that are “often” correct

If ‘bill’ occurs in utterance, then predict ‘BillingCredit’

Hard to find single rule that is very highly accurate

Page 33: Machine Learning

The Boosting ApproachDevise computer program for deriving rough

rules of thumbApply procedure to subset of emailsObtain rule of thumbApply to 2nd subset of emailsObtain 2nd rule of thumbRepeat T times

Page 34: Machine Learning

Details How to choose examples on each round?

- Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

How to combine rules of thumb into single prediction rule?

- Take (weighted) majority vote of rules of thumb !!

Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?

Page 35: Machine Learning

Idea

• At each iteration t :– Weight each training example by how incorrectlyit was classified– Learn a hypothesis – ht– Choose a strength for this hypothesis – αtFinal classifier: weighted combination ofweak learners

Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

Page 36: Machine Learning

36

Boosting

AdaBoost AlgorithmPresenter: Karl Severin

Computer Science and Engineering Dept.

Page 37: Machine Learning

37

Boosting Overview

Goal: Form one strong classifier from multiple weak classifiers.

Proceeds in rounds iteratively producing classifiers from weak learner.

Increases weight given to incorrectly classified examples.

Gives importance to classifier that is inversely proportional to its weighted error.

Each classifier gets a vote based on its importance.

Page 38: Machine Learning

38

Initialize

Initialize with evenly weighted distribution

Begin generating classifiers

Page 39: Machine Learning

39

Error

Quality of classifier based on weighted error:

Probability ht will misclassify an example selected according distribution Dt

Or summation of the weights of all misclassified examples

Page 40: Machine Learning

40

Classifier Importance

αt measures the importance given to classifier ht

αt > 0 if ε t < ( ε t assumed to always be < )

αt is inversely proportional to ε t

Page 41: Machine Learning

41

Update Distribution

Increase weight of misclassified examples

Decrease weight of correctly classified examples

Page 42: Machine Learning

42

Combine Classifiers

When classifying a new instance x all of the weak classifiers get a vote weighted by their α

Page 43: Machine Learning

43

Review

Page 44: Machine Learning

44

Questions?

?

Page 45: Machine Learning

45

Machine Learning

SCE 5095: Special Topics Course

Instructor: Jinbo Bi

Computer Science and Engineering Dept.

Presenter: Brian McClanahan

Topic: Boosting Generalization Error

Page 46: Machine Learning

46

Generalization Error

Generalization error is the true error of a classifier

Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error

For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound

Page 47: Machine Learning

47

Generalization Error First Bound

empirical risk (training error) – boosting rounds – VC Dimension of base classifiers – number of training examples - generalization error

Page 48: Machine Learning

48

Intuition of Bound: Hoeffding’s inequality

Define to be a finite set of hypothesis which map examples to 0 or 1

Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have:

In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .

Page 49: Machine Learning

49

Intuition of Bound: Hoeffding’s inequality

So

and by Hoeffding’s inequality:

Page 50: Machine Learning

50

Intuition of Bound

If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using .

So will hold with probability

Page 51: Machine Learning

51

Intuition of Bound:Bounding all hypotheses in set

So we have bounded the difference between the generalization error and the empirical risk for a single hypothesis

How do we bound the difference for all hypotheses in

Theorem: Let H be a finite space of hypothesis and assume that a random training set of size m is chosen. Then for any .

Again by setting and solving for we have

Will hold with probability

Page 52: Machine Learning

52

Intuition of Bound:Bounding Hypotheses in Infinite Set

What about cases when is infinite Even if is infinite given a set of examples the hypothesis

in may only be capable of labeling the examples in a number of ways <

This implies that though is infinite the hypotheses are divided into classes which produce the same labelings. So the effective number of hypotheses is equal to the number of classes

By using arguments similar to those above in Hoeffding’s inequality can be replaced with the number of effective hypotheses(or dichotomies) along with some additional constants

Page 53: Machine Learning

53

Intuition of Bound:Bounding Hypotheses in Infinite Set

More formally let be a finite set of examples and define the set of dichotomies to be all possible labelings of by hypotheses of

Also define the growth function to be the function

which measures the maximum number of dichotomies for any sample of size

Page 54: Machine Learning

54

Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be any space of hypotheses and assume that a random training set of size is chosen. Then for any

and with probability at least

For all

Page 55: Machine Learning

55

Intuition of Bound:Bounding Hypotheses in Infinite Set

It turns out that the growth function is either polynomial in m or

In the cases where the growth function is polynomial in the degree of the polynomial is the VC dimension of

In the case when the grown function is the VC dimension if infinite

VC dimension –The maximum number of points which can be shattered by . points are said to be shattered if the can realize all possible labelings of the points

Page 56: Machine Learning

56

Intuition of Bound:Bounding Hypotheses in Infinite Set

The VC dimension turns out to be a very natural measure for the complexity of and can be used to bound the growth function given examples

Saucer’s lemma: If is a hypothesis class of VC dimension , then for all m when

Page 57: Machine Learning

57

Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be a hypothesis space of VC dimension and assume that a random training set of size . Then for any

Thus with probability at least

Page 58: Machine Learning

58

Intuition of Bound: Adaboost Generalization Error

The first bound for the Adaboost generalization error follows from Saucer’s Lemma in a similar way

The growth function for Adaboost is bounded using the VC dimension of the base classifiers and the VC dimension for the set of all possible base classifier weight combinations

Again these bounds can be used in conjunction with Hoeffding’s inequality to bound the generalization error

Page 59: Machine Learning

59

Generalization Error First Bound

Think of T and d as variables which come together to define the complexity of the ensemble of hypotheses. The log factors and constants are ignored.

This bound implies that AdaBoost is likely to overfit if run for to many boosting rounds.

Contradicting the spirit of this bound, empirical results suggest that AdaBoost tends to decrease generalization error even after training error reaches zero.

Page 60: Machine Learning

60

Example

The graph shows boosting used on C4.5 to identify images of hand written characters. This experiment was carried out by Robert Schapire et al.

Even as the training error goes to zero and AdaBoost has been run for 1000 rounds the test error continues to decrease

Error

Rounds

Training

TestC4.5 test error

Page 61: Machine Learning

61

Margin

AdaBoost’s resistance to overfitting is attributed to the confidence with which predictions are made

This notion of confidence is quantified by the margin

The margin takes values between 1 and -1

The magnitude of the margin can be viewed as a measure of confidence

Page 62: Machine Learning

62

Generalization Error

In response to empirical findings Schapire et al. derived a new bound.

This new bound is defined in terms of the margins, VC dimension and sample size but not the number of rounds or training error

This bound suggests that higher margins are preferable for lower generalization error

Page 63: Machine Learning

63

Relation to Support Vector Machines

The boosting margins theory turns out to have a strong connection with support vector machines

Imagine that we have already found all weak classifiers and we wish to find the optimal set of weights with which to combine them

The optimal weights would be the weights that maximize the minimum margin

Page 64: Machine Learning

64

Relation to Support Vector Machines

Both support vector machines and boosting can be seen as trying to optimize the same objective function

Both attempt to maximize the minimum margin The difference is the norms used by Boosting and

SVM

Boosting Norms SVM Norms

‖𝛼‖𝟏=∑𝒕

|𝜶𝒕|,‖𝒉(𝒙)‖∞=𝒎𝒂𝒙𝒕∨𝒉𝒕 (𝒙 )∨¿‖𝛼‖𝟐=√∑𝒕 𝜶𝒕𝟐 ,‖𝒉(𝒙)‖𝟐=√∑𝒕 𝒉𝒕(𝒙)𝟐

Page 65: Machine Learning

65

Relation to Support Vector Machines

Effects of different norms– Different norms can lead to very different results,

especially in high dimensional spaces– Different computation requirements

SVM corresponds to quadratic programming while AdaBoost corresponds to linear programming

– Difference in finding linear classifiers for high dimensional spaces

SVMs use kernals to perform low dimensional calculations which are equivalent to inner products in high dimensions

Boosting employs greedy search by using weight redistributions and weak classifiers to coordinates highly correlated with sample labels

Page 66: Machine Learning

66

AdaBoost Examples and Results

SCE 5095: Special Topics Course

Yousra AlmathamiComputer Science and Engineering Dept.

Page 67: Machine Learning

67

The Rules for Boosting

1. Set all weights of training examples equal

2. Train a weak learner on the weighted examples

3. Check how well the weak learner performs on data and give it a weight based on how well it did

4. Re-weight training examples and repeat

5. When done, predict by voting by majority

Page 68: Machine Learning

68

Overview of Adaboost

Taken from Bishop

Page 69: Machine Learning

69

Toy Example

Taken from Schapire

1.5 Positive examples2.5 Negative examples3.2-Dimensional plane4.Weak hyps: linear separators5.3 iterations6. All given equal weights

Page 70: Machine Learning

70

First classifier

Taken from Schapire

Misclassified examples are circled, given more weight

Page 71: Machine Learning

71

First 2 classifiers

Taken from Schapire

Misclassified examples are circled, given more weight

Page 72: Machine Learning

72

First 3 classifiers

Taken from Schapire

Misclassified examples are circled, given more weight

Page 73: Machine Learning

73

Final Classifier learned by Boosting

Taken from Schapire

Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.

Page 74: Machine Learning

74

MATLAB CODE FOR AdaBoost

Page 75: Machine Learning

75

Cardiac Ultrasound Videos

Class Training Test Boosting cycles

A2C 20 9 200

A4C 20 10 200

Page 76: Machine Learning

76

Breast Cancer Boosting Results

Class Training Test Boosting cycles

Benign 100 100 200

Malignant 100 100 200

Page 77: Machine Learning

77

Boosting Demo

Online Demo taken from www.Mathworks.com by Richard Stapenhurst

Page 78: Machine Learning

78

Machine Learning

Multiclass Classification for Boosting

Presented By: Chris Kuhn

Computer Science and Engineering Dept.

Page 79: Machine Learning

79

The Idea

Everything covered has been the binary two class classification problem, what happens when dealing with more than two classes?

What changes in the problem? y = {-1,+1} → y = {1, 2, …, k} Random guess value changes from ½ to 1/k

Weak learning classifiers need to be updated

Can we update the weak learning classifiers to just have an accuracy > 1/k + γ?

There are cases where this condition is satisfied but there is no way to drive training error to 0 making boosting impossible

THIS IS TOO WEAK!

Page 80: Machine Learning

80

AdaBoost.M1

Page 81: Machine Learning

81

AdaBoost.M1

Almost the same algorithm as regular AdaBoost Advantage:

Works similar to binary AdaBoost but on multiclass problems

Disadvantage: If weak hypothesis has error slightly better

than ½ then boosting is possible For k = 2, slightly better than ½ represents a

better than random guess, what about k > 2? TOO STRONG! (unless weak learner is strong)

Page 82: Machine Learning

82

An Alternative Approach

Can we create multiple binary problems out of a multiclass problem?

For example xi is the correct label yi or y` K – 1 binary problems for each example

h(x,y) = 1 if y is the label for x, 0 otherwise

h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)

h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)

h(xi, yi)=h(xi, y`) → uninformative

Page 83: Machine Learning

83

An Alternative Approach

Can we create multiple binary problems out of a multiclass problem?

For example xi is the correct label yi or y` K – 1 binary problems for each example

h(x,y) = 1 if y is the label for x, 0 otherwise

h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)

h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)

h(xi, yi)=h(xi, y`) → uninformative

Page 84: Machine Learning

84

AdaBoost.M2

Page 85: Machine Learning

85

AdaBoost.MR

Generalized to allow multiple labels per example Different initial distribution ht : X x Y → Real Number ht used to rank labels for a given example Now have ranking loss instead of error rate

Page 86: Machine Learning

86

AdaBoost.MR

Page 87: Machine Learning

87

Additional Algorithms

AdaBoost.MH One-against-all Requires strong weak learning conditions

AdaBoost.MO

Runs MH as part of algorithm and uses strong classifier to generate alternative strong classifiers which can perform and extra voting step

Still requires a strong weak learning condition SAMME

Allows weak learners slightly better than random, Cost matrix instead of weights and a different equation for weak classifier combination

Conditions can be too weak for strong margins

Page 88: Machine Learning

88

Take Home

Yes, it is possible There are many multiclass boosting algorithms

available No, there is no 'one size fits all' multiclass

algorithm