Machine Learning
description
Transcript of Machine Learning
1
Machine Learning
CSE 5095: Special Topics Course
Boosting
Nhan NguyenComputer Science and Engineering Dept.
2
Boosting
Method for converting rules of thumb into a prediction rule.
Rule of thumb? Method?
3
Binary Classification
X: set of all possible instances or examples. - e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast c: X {0,1}: the target concept to learn.
- e.g., c: EnjoySport {0,1} H: set of concept hypotheses
- e.g., conjunctions of literals: <?,Cold,High,?,?,?> C: concept class, a set of target concept c. D: target distribution, a fixed probability
distribution over X. Training and test examples are drawn according to D.
4
Binary Classification
S: training sample<x1,c(x1)>,…,<xm,c(xm)>
The learning algorithm receives sample S and selects a hypothesis from H approximating c.
- Find a hypothesis hH such that h(x) = c(x) x S
5
Errors
True error or generalization error of h with respect to the target concept c and distribution D:
[h] =
Empirical error: average error of h on the training sample S drawn according to distribution D,[h] = =
6
Errors
Questions:– Can we bound the true error of a hypothesis
given only its training error?– How many examples are needed for a good
approximation?
7
Approximate Concept Learning
Requiring a learner to acquire the right concept is too strict
Instead, we will allow the learner to produce a good approximation to the actual concept
8
General Assumptions
Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x)
Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them).
Goal: h should have a low error rate on new examples drawn from the same distribution D.
[h] =
9
PAC learning model
PAC learning: Probably Approximately Correct learning
The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to c and we want to have small error of h [h]
If [h] is small, h is “probably approximately correct”. Formally, h is PAC if
Pr[[h] 1 - for all c C, > 0, > 0, and all distributions D
10
PAC learning model
Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that:
Pr[[h] 1 - uses at most poly(1/1/, size(X), size(c)) examples and running time. : accuracy, 1 - : confidence.
Such an L is called a strong Learner
11
PAC learning model
Learner L is a weak learner if learner L output a hypothesis h H such that:
Pr[[h] ( - 1 - for all c C, > 0, > 0, and all distributions D
A weak learner only output an hypothesis that performs slightly better than random guessing
Hypothesis boosting problem: can we “boost” a weak learner into a strong learner?
Rule of thumb ~ weak leaner Method ~
12
Boosting a weak learner – Majority Vote
L leans on first N training points L randomly filters the next batch of training
points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2.
L builds a third training set of N points for which h1 and h2 disagree, and produces h3.
L outputs h = Majority Vote(h1, h2, h3)
13
Boosting [Schapire ’89]
Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote. A formal description of boosting:
o Given training set ((…, (o {-1, +1}: correct label of Xo for t = 1, …, T:o construct distribution on {1, …, m}o find weak hypothesis : X {-1, +1} with small error on o output final hypothesis
14
Boosting
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
.
.
.
(x)
(x)
(x)
(x)
Final hypothesisH(x) = sign[]
15
Boosting algorithms
AdaBoost (Adaptive Boosting) LPBoost (Linear Programming Boosting) BrownBoost MadaBoost (modifying the weighting system of
AdaBoost) LogitBoost
16
Lecture
Motivating example Adaboost Training Error Overfitting Generalization Error Examples of Adaboost Multiclass for weak learner
17
Thank you!
18
Machine Learning
Proof of Bound on Adaboost Training ErrorAaron Palmer
19
Theorem: 2 Class Error Bounds
Assume t = - t
= error rate on round of boosting = how much better than random guessing
small, positive number
Training error is bounded by Hfinal
20
Implications?
= number of rounds of boosting and do not need to be known in advance
As long as then the training error will decrease exponentially as a function of
21
Proof Part I: Unwrap Distribution
Let
= = =
=
22
Proof Part II: training error
Training error () = =
= =
23
Proof Part III:
Set equal to zero and solve for
Plug back into
24
Part III: Continued
Plug in the definition t t
25
Exponential Bound
Use property
Take x to be 1
26
Putting it together
We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined
Bound is pretty loose
27
Example:
Suppose that all are at least 10% so that no has an error rate above 40%
What upper bound does the theorem place on the training error?
Answer:
28
Overfitting?
Does the proof say anything about overfitting?
While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?
BoostingAyman Alharbi
Example (Spam emails)
* problem: filter out spam(junk email)
- Gather large collection of examples of spam and non-spam
From: [email protected] “can you review a paper” ... non-spam
From: [email protected] “Win 10000$ easily !!” ... spam
Example (Spam emails)
Main Observation - Easy to find “rules of thumb” that are “often” correct
- Hard to find single rule that is very highly accurate
If ‘buy now’ occurs in message, then predict ‘spam’
Example (Phone Cards)
Goal: automatically categorize type of call requested by phone customer
(Collect, CallingCard, PersonToPerson, etc.)- Yes I’d like to place a collect call long distance please
(Collect)
- operator I need to make a call but I need to bill it to my office
(ThirdNumber)
- I just called the wrong and I would like to have that taken off of my bill
(BillingCredit)
Main Observation Easy to find
“rules of thumb” that are “often” correct
If ‘bill’ occurs in utterance, then predict ‘BillingCredit’
Hard to find single rule that is very highly accurate
The Boosting ApproachDevise computer program for deriving rough
rules of thumbApply procedure to subset of emailsObtain rule of thumbApply to 2nd subset of emailsObtain 2nd rule of thumbRepeat T times
Details How to choose examples on each round?
- Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)
How to combine rules of thumb into single prediction rule?
- Take (weighted) majority vote of rules of thumb !!
Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?
Idea
• At each iteration t :– Weight each training example by how incorrectlyit was classified– Learn a hypothesis – ht– Choose a strength for this hypothesis – αtFinal classifier: weighted combination ofweak learners
Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote
36
Boosting
AdaBoost AlgorithmPresenter: Karl Severin
Computer Science and Engineering Dept.
37
Boosting Overview
Goal: Form one strong classifier from multiple weak classifiers.
Proceeds in rounds iteratively producing classifiers from weak learner.
Increases weight given to incorrectly classified examples.
Gives importance to classifier that is inversely proportional to its weighted error.
Each classifier gets a vote based on its importance.
38
Initialize
Initialize with evenly weighted distribution
Begin generating classifiers
39
Error
Quality of classifier based on weighted error:
Probability ht will misclassify an example selected according distribution Dt
Or summation of the weights of all misclassified examples
40
Classifier Importance
αt measures the importance given to classifier ht
αt > 0 if ε t < ( ε t assumed to always be < )
αt is inversely proportional to ε t
41
Update Distribution
Increase weight of misclassified examples
Decrease weight of correctly classified examples
42
Combine Classifiers
When classifying a new instance x all of the weak classifiers get a vote weighted by their α
43
Review
44
Questions?
?
45
Machine Learning
SCE 5095: Special Topics Course
Instructor: Jinbo Bi
Computer Science and Engineering Dept.
Presenter: Brian McClanahan
Topic: Boosting Generalization Error
46
Generalization Error
Generalization error is the true error of a classifier
Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error
For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound
47
Generalization Error First Bound
empirical risk (training error) – boosting rounds – VC Dimension of base classifiers – number of training examples - generalization error
48
Intuition of Bound: Hoeffding’s inequality
Define to be a finite set of hypothesis which map examples to 0 or 1
Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have:
In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .
49
Intuition of Bound: Hoeffding’s inequality
So
and by Hoeffding’s inequality:
50
Intuition of Bound
If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using .
So will hold with probability
51
Intuition of Bound:Bounding all hypotheses in set
So we have bounded the difference between the generalization error and the empirical risk for a single hypothesis
How do we bound the difference for all hypotheses in
Theorem: Let H be a finite space of hypothesis and assume that a random training set of size m is chosen. Then for any .
Again by setting and solving for we have
Will hold with probability
52
Intuition of Bound:Bounding Hypotheses in Infinite Set
What about cases when is infinite Even if is infinite given a set of examples the hypothesis
in may only be capable of labeling the examples in a number of ways <
This implies that though is infinite the hypotheses are divided into classes which produce the same labelings. So the effective number of hypotheses is equal to the number of classes
By using arguments similar to those above in Hoeffding’s inequality can be replaced with the number of effective hypotheses(or dichotomies) along with some additional constants
53
Intuition of Bound:Bounding Hypotheses in Infinite Set
More formally let be a finite set of examples and define the set of dichotomies to be all possible labelings of by hypotheses of
Also define the growth function to be the function
which measures the maximum number of dichotomies for any sample of size
54
Intuition of Bound:Bounding Hypotheses in Infinite Set
Theorem: Let be any space of hypotheses and assume that a random training set of size is chosen. Then for any
and with probability at least
For all
55
Intuition of Bound:Bounding Hypotheses in Infinite Set
It turns out that the growth function is either polynomial in m or
In the cases where the growth function is polynomial in the degree of the polynomial is the VC dimension of
In the case when the grown function is the VC dimension if infinite
VC dimension –The maximum number of points which can be shattered by . points are said to be shattered if the can realize all possible labelings of the points
56
Intuition of Bound:Bounding Hypotheses in Infinite Set
The VC dimension turns out to be a very natural measure for the complexity of and can be used to bound the growth function given examples
Saucer’s lemma: If is a hypothesis class of VC dimension , then for all m when
57
Intuition of Bound:Bounding Hypotheses in Infinite Set
Theorem: Let be a hypothesis space of VC dimension and assume that a random training set of size . Then for any
Thus with probability at least
58
Intuition of Bound: Adaboost Generalization Error
The first bound for the Adaboost generalization error follows from Saucer’s Lemma in a similar way
The growth function for Adaboost is bounded using the VC dimension of the base classifiers and the VC dimension for the set of all possible base classifier weight combinations
Again these bounds can be used in conjunction with Hoeffding’s inequality to bound the generalization error
59
Generalization Error First Bound
Think of T and d as variables which come together to define the complexity of the ensemble of hypotheses. The log factors and constants are ignored.
This bound implies that AdaBoost is likely to overfit if run for to many boosting rounds.
Contradicting the spirit of this bound, empirical results suggest that AdaBoost tends to decrease generalization error even after training error reaches zero.
60
Example
The graph shows boosting used on C4.5 to identify images of hand written characters. This experiment was carried out by Robert Schapire et al.
Even as the training error goes to zero and AdaBoost has been run for 1000 rounds the test error continues to decrease
Error
Rounds
Training
TestC4.5 test error
61
Margin
AdaBoost’s resistance to overfitting is attributed to the confidence with which predictions are made
This notion of confidence is quantified by the margin
The margin takes values between 1 and -1
The magnitude of the margin can be viewed as a measure of confidence
62
Generalization Error
In response to empirical findings Schapire et al. derived a new bound.
This new bound is defined in terms of the margins, VC dimension and sample size but not the number of rounds or training error
This bound suggests that higher margins are preferable for lower generalization error
63
Relation to Support Vector Machines
The boosting margins theory turns out to have a strong connection with support vector machines
Imagine that we have already found all weak classifiers and we wish to find the optimal set of weights with which to combine them
The optimal weights would be the weights that maximize the minimum margin
64
Relation to Support Vector Machines
Both support vector machines and boosting can be seen as trying to optimize the same objective function
Both attempt to maximize the minimum margin The difference is the norms used by Boosting and
SVM
Boosting Norms SVM Norms
‖𝛼‖𝟏=∑𝒕
|𝜶𝒕|,‖𝒉(𝒙)‖∞=𝒎𝒂𝒙𝒕∨𝒉𝒕 (𝒙 )∨¿‖𝛼‖𝟐=√∑𝒕 𝜶𝒕𝟐 ,‖𝒉(𝒙)‖𝟐=√∑𝒕 𝒉𝒕(𝒙)𝟐
65
Relation to Support Vector Machines
Effects of different norms– Different norms can lead to very different results,
especially in high dimensional spaces– Different computation requirements
SVM corresponds to quadratic programming while AdaBoost corresponds to linear programming
– Difference in finding linear classifiers for high dimensional spaces
SVMs use kernals to perform low dimensional calculations which are equivalent to inner products in high dimensions
Boosting employs greedy search by using weight redistributions and weak classifiers to coordinates highly correlated with sample labels
66
AdaBoost Examples and Results
SCE 5095: Special Topics Course
Yousra AlmathamiComputer Science and Engineering Dept.
67
The Rules for Boosting
1. Set all weights of training examples equal
2. Train a weak learner on the weighted examples
3. Check how well the weak learner performs on data and give it a weight based on how well it did
4. Re-weight training examples and repeat
5. When done, predict by voting by majority
68
Overview of Adaboost
Taken from Bishop
69
Toy Example
Taken from Schapire
1.5 Positive examples2.5 Negative examples3.2-Dimensional plane4.Weak hyps: linear separators5.3 iterations6. All given equal weights
70
First classifier
Taken from Schapire
Misclassified examples are circled, given more weight
71
First 2 classifiers
Taken from Schapire
Misclassified examples are circled, given more weight
72
First 3 classifiers
Taken from Schapire
Misclassified examples are circled, given more weight
73
Final Classifier learned by Boosting
Taken from Schapire
Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.
74
MATLAB CODE FOR AdaBoost
75
Cardiac Ultrasound Videos
Class Training Test Boosting cycles
A2C 20 9 200
A4C 20 10 200
76
Breast Cancer Boosting Results
Class Training Test Boosting cycles
Benign 100 100 200
Malignant 100 100 200
77
Boosting Demo
Online Demo taken from www.Mathworks.com by Richard Stapenhurst
78
Machine Learning
Multiclass Classification for Boosting
Presented By: Chris Kuhn
Computer Science and Engineering Dept.
79
The Idea
Everything covered has been the binary two class classification problem, what happens when dealing with more than two classes?
What changes in the problem? y = {-1,+1} → y = {1, 2, …, k} Random guess value changes from ½ to 1/k
Weak learning classifiers need to be updated
Can we update the weak learning classifiers to just have an accuracy > 1/k + γ?
There are cases where this condition is satisfied but there is no way to drive training error to 0 making boosting impossible
THIS IS TOO WEAK!
80
AdaBoost.M1
81
AdaBoost.M1
Almost the same algorithm as regular AdaBoost Advantage:
Works similar to binary AdaBoost but on multiclass problems
Disadvantage: If weak hypothesis has error slightly better
than ½ then boosting is possible For k = 2, slightly better than ½ represents a
better than random guess, what about k > 2? TOO STRONG! (unless weak learner is strong)
82
An Alternative Approach
Can we create multiple binary problems out of a multiclass problem?
For example xi is the correct label yi or y` K – 1 binary problems for each example
h(x,y) = 1 if y is the label for x, 0 otherwise
h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
h(xi, yi)=h(xi, y`) → uninformative
83
An Alternative Approach
Can we create multiple binary problems out of a multiclass problem?
For example xi is the correct label yi or y` K – 1 binary problems for each example
h(x,y) = 1 if y is the label for x, 0 otherwise
h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
h(xi, yi)=h(xi, y`) → uninformative
84
AdaBoost.M2
85
AdaBoost.MR
Generalized to allow multiple labels per example Different initial distribution ht : X x Y → Real Number ht used to rank labels for a given example Now have ranking loss instead of error rate
86
AdaBoost.MR
87
Additional Algorithms
AdaBoost.MH One-against-all Requires strong weak learning conditions
AdaBoost.MO
Runs MH as part of algorithm and uses strong classifier to generate alternative strong classifiers which can perform and extra voting step
Still requires a strong weak learning condition SAMME
Allows weak learners slightly better than random, Cost matrix instead of weights and a different equation for weak classifier combination
Conditions can be too weak for strong margins
88
Take Home
Yes, it is possible There are many multiclass boosting algorithms
available No, there is no 'one size fits all' multiclass
algorithm