Machine Learning for Data Mining - Combining models
-
Upload
andres-mendez-vazquez -
Category
Engineering
-
view
25 -
download
6
Transcript of Machine Learning for Data Mining - Combining models
![Page 1: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/1.jpg)
Machine Learning for Data MiningCombining Models
Andres Mendez-Vazquez
July 19, 2015
1 / 65
![Page 2: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/2.jpg)
Images/cinvestav-1.jpg
Outline1 Introduction
2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging
3 CommitteesBootstrap Data Sets
4 BoostingAdaBoost Development
Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick
AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior
Example
2 / 65
![Page 3: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/3.jpg)
Images/cinvestav-1.jpg
Introduction
ObservationIt is often found that improved performance can be obtained by combiningmultiple classifiers together in some way.
Example: CommitteesWe might train L different classifiers and then make predictions using theaverage of the predictions made by each classifier.
Example: BoostingIt involves training multiple models in sequence in which the error functionused to train a particular model depends on the performance of theprevious models.
3 / 65
![Page 4: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/4.jpg)
Images/cinvestav-1.jpg
Introduction
ObservationIt is often found that improved performance can be obtained by combiningmultiple classifiers together in some way.
Example: CommitteesWe might train L different classifiers and then make predictions using theaverage of the predictions made by each classifier.
Example: BoostingIt involves training multiple models in sequence in which the error functionused to train a particular model depends on the performance of theprevious models.
3 / 65
![Page 5: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/5.jpg)
Images/cinvestav-1.jpg
Introduction
ObservationIt is often found that improved performance can be obtained by combiningmultiple classifiers together in some way.
Example: CommitteesWe might train L different classifiers and then make predictions using theaverage of the predictions made by each classifier.
Example: BoostingIt involves training multiple models in sequence in which the error functionused to train a particular model depends on the performance of theprevious models.
3 / 65
![Page 6: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/6.jpg)
Images/cinvestav-1.jpg
In addition
Instead of averaging the predictions of a set of modelsYou can use an alternative form of combination that selects one of themodels to make the prediction.
WhereThe choice of model is a function of the input variables.
ThusDifferent Models become responsible for making decisions in differentregions of the input space.
4 / 65
![Page 7: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/7.jpg)
Images/cinvestav-1.jpg
In addition
Instead of averaging the predictions of a set of modelsYou can use an alternative form of combination that selects one of themodels to make the prediction.
WhereThe choice of model is a function of the input variables.
ThusDifferent Models become responsible for making decisions in differentregions of the input space.
4 / 65
![Page 8: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/8.jpg)
Images/cinvestav-1.jpg
In addition
Instead of averaging the predictions of a set of modelsYou can use an alternative form of combination that selects one of themodels to make the prediction.
WhereThe choice of model is a function of the input variables.
ThusDifferent Models become responsible for making decisions in differentregions of the input space.
4 / 65
![Page 9: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/9.jpg)
Images/cinvestav-1.jpg
We want this
Models in charge of different set of inputs
5 / 65
![Page 10: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/10.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 11: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/11.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 12: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/12.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 13: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/13.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 14: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/14.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 15: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/15.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 16: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/16.jpg)
Images/cinvestav-1.jpg
Example: Decision Trees
We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.
Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.
Thus it is better to soften the combination by usingIf we have K classifier for a conditional distribution p (t|x, k).
I x is the input variable.I t is the target variable.I k = 1, 2, ...,K indexes the classifiers.
6 / 65
![Page 17: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/17.jpg)
Images/cinvestav-1.jpg
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =K∑
k=1πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of modelsThey can be viewed as mixture distribution in which the componentdensities and the mixing coefficients are conditioned on the input variablesand are known as mixture experts.
7 / 65
![Page 18: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/18.jpg)
Images/cinvestav-1.jpg
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =K∑
k=1πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of modelsThey can be viewed as mixture distribution in which the componentdensities and the mixing coefficients are conditioned on the input variablesand are known as mixture experts.
7 / 65
![Page 19: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/19.jpg)
Images/cinvestav-1.jpg
Outline1 Introduction
2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging
3 CommitteesBootstrap Data Sets
4 BoostingAdaBoost Development
Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick
AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior
Example
8 / 65
![Page 20: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/20.jpg)
Images/cinvestav-1.jpg
Now
AlthoughModel Combinations and Bayesian Model Averaging look similar.
HoweverThey ara actually different
For thisWe have the following example.
9 / 65
![Page 21: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/21.jpg)
Images/cinvestav-1.jpg
Now
AlthoughModel Combinations and Bayesian Model Averaging look similar.
HoweverThey ara actually different
For thisWe have the following example.
9 / 65
![Page 22: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/22.jpg)
Images/cinvestav-1.jpg
Now
AlthoughModel Combinations and Bayesian Model Averaging look similar.
HoweverThey ara actually different
For thisWe have the following example.
9 / 65
![Page 23: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/23.jpg)
Images/cinvestav-1.jpg
Example
For this consider the followingMixture of Gaussians with a binary latent variable z indicating to whichcomponent a point belongs to.
p (x, z) (2)
ThusCorresponding density over the observed variable x using marginalization:
p (x) =∑
zp (x, z) (3)
10 / 65
![Page 24: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/24.jpg)
Images/cinvestav-1.jpg
Example
For this consider the followingMixture of Gaussians with a binary latent variable z indicating to whichcomponent a point belongs to.
p (x, z) (2)
ThusCorresponding density over the observed variable x using marginalization:
p (x) =∑
zp (x, z) (3)
10 / 65
![Page 25: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/25.jpg)
Images/cinvestav-1.jpg
In the case of Gaussian distributionsWe have
p (x) =K∑
k=1πkN (x|µkΣk) (4)
Now, for independent, identically distributed dataX = {x1,x2, ...,xN}
p (X) =N∏
n=1p (xn) =
N∏n=1
[∑zn
p (xn , zn)]
(5)
Now, supposeWe have several different models indexed by h = 1, ...,H with priorprobabilities.
For instance one model might be a mixture of Gaussians and anothermodel might be a mixture of Cauchy distributions.
11 / 65
![Page 26: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/26.jpg)
Images/cinvestav-1.jpg
In the case of Gaussian distributionsWe have
p (x) =K∑
k=1πkN (x|µkΣk) (4)
Now, for independent, identically distributed dataX = {x1,x2, ...,xN}
p (X) =N∏
n=1p (xn) =
N∏n=1
[∑zn
p (xn , zn)]
(5)
Now, supposeWe have several different models indexed by h = 1, ...,H with priorprobabilities.
For instance one model might be a mixture of Gaussians and anothermodel might be a mixture of Cauchy distributions.
11 / 65
![Page 27: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/27.jpg)
Images/cinvestav-1.jpg
In the case of Gaussian distributionsWe have
p (x) =K∑
k=1πkN (x|µkΣk) (4)
Now, for independent, identically distributed dataX = {x1,x2, ...,xN}
p (X) =N∏
n=1p (xn) =
N∏n=1
[∑zn
p (xn , zn)]
(5)
Now, supposeWe have several different models indexed by h = 1, ...,H with priorprobabilities.
For instance one model might be a mixture of Gaussians and anothermodel might be a mixture of Cauchy distributions.
11 / 65
![Page 28: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/28.jpg)
Images/cinvestav-1.jpg
Bayesian Model AveragingThe Marginal Distribution is
p (X) =H∑
h=1p (X , h) =
H∑h=1
p (X |h) p (h) (6)
Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for
generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.
12 / 65
![Page 29: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/29.jpg)
Images/cinvestav-1.jpg
Bayesian Model AveragingThe Marginal Distribution is
p (X) =H∑
h=1p (X , h) =
H∑h=1
p (X |h) p (h) (6)
Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for
generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.
12 / 65
![Page 30: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/30.jpg)
Images/cinvestav-1.jpg
Bayesian Model AveragingThe Marginal Distribution is
p (X) =H∑
h=1p (X , h) =
H∑h=1
p (X |h) p (h) (6)
Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for
generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.
12 / 65
![Page 31: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/31.jpg)
Images/cinvestav-1.jpg
Bayesian Model AveragingThe Marginal Distribution is
p (X) =H∑
h=1p (X , h) =
H∑h=1
p (X |h) p (h) (6)
Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for
generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.
12 / 65
![Page 32: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/32.jpg)
Images/cinvestav-1.jpg
Bayesian Model AveragingThe Marginal Distribution is
p (X) =H∑
h=1p (X , h) =
H∑h=1
p (X |h) p (h) (6)
Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for
generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.
12 / 65
![Page 33: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/33.jpg)
Images/cinvestav-1.jpg
The Differences
Bayesian model averagingThe whole data set is generated by a single model.
Model combinationDifferent data points within the data set can potentially be generated fromdifferent values of the latent variable z and hence by different components.
13 / 65
![Page 34: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/34.jpg)
Images/cinvestav-1.jpg
The Differences
Bayesian model averagingThe whole data set is generated by a single model.
Model combinationDifferent data points within the data set can potentially be generated fromdifferent values of the latent variable z and hence by different components.
13 / 65
![Page 35: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/35.jpg)
Images/cinvestav-1.jpg
Committees
IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.
Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.
Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.
14 / 65
![Page 36: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/36.jpg)
Images/cinvestav-1.jpg
Committees
IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.
Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.
Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.
14 / 65
![Page 37: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/37.jpg)
Images/cinvestav-1.jpg
Committees
IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.
Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.
Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.
14 / 65
![Page 38: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/38.jpg)
Images/cinvestav-1.jpg
Committees
IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.
Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.
Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.
14 / 65
![Page 39: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/39.jpg)
Images/cinvestav-1.jpg
For example
We can do the follwoingWhen we averaged a set of low-bias models, we obtained accuratepredictions of the underlying sinusoidal function from which the data weregenerated.
15 / 65
![Page 40: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/40.jpg)
Images/cinvestav-1.jpg
However
Big ProblemWe have normally a single data set
ThusWe need to introduce certain variability between the different committeemembers.
One approachYou can use bootstrap data sets.
16 / 65
![Page 41: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/41.jpg)
Images/cinvestav-1.jpg
However
Big ProblemWe have normally a single data set
ThusWe need to introduce certain variability between the different committeemembers.
One approachYou can use bootstrap data sets.
16 / 65
![Page 42: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/42.jpg)
Images/cinvestav-1.jpg
Outline1 Introduction
2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging
3 CommitteesBootstrap Data Sets
4 BoostingAdaBoost Development
Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick
AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior
Example
17 / 65
![Page 43: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/43.jpg)
Images/cinvestav-1.jpg
What to do with the Bootstrap Data Set
Given a Regression ProblemHere, we are trying to predict the value of a single continuous variable.
What?Suppose our original data set consists of N data points X = {x1, ..., xN} .
ThusWe can create a new data set XB by drawing N points at random from X ,with replacement, so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.
18 / 65
![Page 44: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/44.jpg)
Images/cinvestav-1.jpg
What to do with the Bootstrap Data Set
Given a Regression ProblemHere, we are trying to predict the value of a single continuous variable.
What?Suppose our original data set consists of N data points X = {x1, ..., xN} .
ThusWe can create a new data set XB by drawing N points at random from X ,with replacement, so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.
18 / 65
![Page 45: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/45.jpg)
Images/cinvestav-1.jpg
What to do with the Bootstrap Data Set
Given a Regression ProblemHere, we are trying to predict the value of a single continuous variable.
What?Suppose our original data set consists of N data points X = {x1, ..., xN} .
ThusWe can create a new data set XB by drawing N points at random from X ,with replacement, so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.
18 / 65
![Page 46: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/46.jpg)
Images/cinvestav-1.jpg
Furthermore
This is repeated M timesThus, we generate M data sets each of size N and each obtained bysampling from the original data set X .
19 / 65
![Page 47: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/47.jpg)
Images/cinvestav-1.jpg
Thus
Use each of them to train a copy ym (x) of a predictive regressionmodel to predict a single continuous variableThen,
ycom (x) = 1M
M∑m=1
ym (x) (7)
This is also known as Bootstrap Aggregation or Bagging.
20 / 65
![Page 48: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/48.jpg)
Images/cinvestav-1.jpg
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + εm (x) (8)
The average sum-of-squares error over the data takes the form
Ex[{ym (x)− h (x)}2
]= Ex
[{εm (x)}2
](9)
What is Ex?It denotes a frequentest expectation with respect to the distribution of theinput vector x.
21 / 65
![Page 49: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/49.jpg)
Images/cinvestav-1.jpg
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + εm (x) (8)
The average sum-of-squares error over the data takes the form
Ex[{ym (x)− h (x)}2
]= Ex
[{εm (x)}2
](9)
What is Ex?It denotes a frequentest expectation with respect to the distribution of theinput vector x.
21 / 65
![Page 50: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/50.jpg)
Images/cinvestav-1.jpg
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + εm (x) (8)
The average sum-of-squares error over the data takes the form
Ex[{ym (x)− h (x)}2
]= Ex
[{εm (x)}2
](9)
What is Ex?It denotes a frequentest expectation with respect to the distribution of theinput vector x.
21 / 65
![Page 51: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/51.jpg)
Images/cinvestav-1.jpg
Meaning
Thus, the average error is
EAV = 1M
M∑m=1
Ex[{εm (x)}2
](10)
Similarly the Expected error over the committee
ECOM = Ex
{ 1M
M∑m=1
(ym (x)− h (x))}2 = Ex
{ 1M
M∑m=1
εm (x)}2(11)
22 / 65
![Page 52: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/52.jpg)
Images/cinvestav-1.jpg
Meaning
Thus, the average error is
EAV = 1M
M∑m=1
Ex[{εm (x)}2
](10)
Similarly the Expected error over the committee
ECOM = Ex
{ 1M
M∑m=1
(ym (x)− h (x))}2 = Ex
{ 1M
M∑m=1
εm (x)}2(11)
22 / 65
![Page 53: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/53.jpg)
Images/cinvestav-1.jpg
Assume that the errors have zero mean and areuncorrelated
Assume that the errors have zero mean and are uncorrelated
Ex [εm (x)] =0Ex [εm (x) εl (x)] = 0, for m 6= l
23 / 65
![Page 54: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/54.jpg)
Images/cinvestav-1.jpg
ThenWe have that
ECOM =1
M2Ex
[{M∑
m=1
(εm (x))
}2]
=1
M2Ex
M∑m=1
ε2m (x) +
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=
1M2
Ex
(M∑
m=1
ε2m (x)
)+ Ex
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=1
M2
Ex
(M∑
m=1
ε2m (x)
)+
M∑h=1
M∑k=1
h 6=k
Ex (εh (x) εk (x))
=
1M2
{Ex
(M∑
m=1
ε2m (x)
)}=
1M
{1M
Ex
(M∑
m=1
ε2m (x)
)}24 / 65
![Page 55: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/55.jpg)
Images/cinvestav-1.jpg
ThenWe have that
ECOM =1
M2Ex
[{M∑
m=1
(εm (x))
}2]
=1
M2Ex
M∑m=1
ε2m (x) +
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=
1M2
Ex
(M∑
m=1
ε2m (x)
)+ Ex
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=1
M2
Ex
(M∑
m=1
ε2m (x)
)+
M∑h=1
M∑k=1
h 6=k
Ex (εh (x) εk (x))
=
1M2
{Ex
(M∑
m=1
ε2m (x)
)}=
1M
{1M
Ex
(M∑
m=1
ε2m (x)
)}24 / 65
![Page 56: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/56.jpg)
Images/cinvestav-1.jpg
ThenWe have that
ECOM =1
M2Ex
[{M∑
m=1
(εm (x))
}2]
=1
M2Ex
M∑m=1
ε2m (x) +
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=
1M2
Ex
(M∑
m=1
ε2m (x)
)+ Ex
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=1
M2
Ex
(M∑
m=1
ε2m (x)
)+
M∑h=1
M∑k=1
h 6=k
Ex (εh (x) εk (x))
=
1M2
{Ex
(M∑
m=1
ε2m (x)
)}=
1M
{1M
Ex
(M∑
m=1
ε2m (x)
)}24 / 65
![Page 57: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/57.jpg)
Images/cinvestav-1.jpg
ThenWe have that
ECOM =1
M2Ex
[{M∑
m=1
(εm (x))
}2]
=1
M2Ex
M∑m=1
ε2m (x) +
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=
1M2
Ex
(M∑
m=1
ε2m (x)
)+ Ex
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=1
M2
Ex
(M∑
m=1
ε2m (x)
)+
M∑h=1
M∑k=1
h 6=k
Ex (εh (x) εk (x))
=
1M2
{Ex
(M∑
m=1
ε2m (x)
)}=
1M
{1M
Ex
(M∑
m=1
ε2m (x)
)}24 / 65
![Page 58: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/58.jpg)
Images/cinvestav-1.jpg
ThenWe have that
ECOM =1
M2Ex
[{M∑
m=1
(εm (x))
}2]
=1
M2Ex
M∑m=1
ε2m (x) +
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=
1M2
Ex
(M∑
m=1
ε2m (x)
)+ Ex
M∑h=1
M∑k=1
h 6=k
εh (x) εk (x)
=1
M2
Ex
(M∑
m=1
ε2m (x)
)+
M∑h=1
M∑k=1
h 6=k
Ex (εh (x) εk (x))
=
1M2
{Ex
(M∑
m=1
ε2m (x)
)}=
1M
{1M
Ex
(M∑
m=1
ε2m (x)
)}24 / 65
![Page 59: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/59.jpg)
Images/cinvestav-1.jpg
We finally obtain
We obtain
ECOM = 1M EAV (12)
Looks great BUT!!!Unfortunately, it depends on the key assumption that the errors due to theindividual models are uncorrelated.
25 / 65
![Page 60: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/60.jpg)
Images/cinvestav-1.jpg
We finally obtain
We obtain
ECOM = 1M EAV (12)
Looks great BUT!!!Unfortunately, it depends on the key assumption that the errors due to theindividual models are uncorrelated.
25 / 65
![Page 61: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/61.jpg)
Images/cinvestav-1.jpg
Thus
The Reality!!!The errors are typically highly correlated, and the reduction in overall erroris generally small.
Something NotableHowever, It can be shown that the expected committee error will notexceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something betterA more sophisticated technique known as boosting.
26 / 65
![Page 62: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/62.jpg)
Images/cinvestav-1.jpg
Thus
The Reality!!!The errors are typically highly correlated, and the reduction in overall erroris generally small.
Something NotableHowever, It can be shown that the expected committee error will notexceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something betterA more sophisticated technique known as boosting.
26 / 65
![Page 63: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/63.jpg)
Images/cinvestav-1.jpg
Thus
The Reality!!!The errors are typically highly correlated, and the reduction in overall erroris generally small.
Something NotableHowever, It can be shown that the expected committee error will notexceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something betterA more sophisticated technique known as boosting.
26 / 65
![Page 64: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/64.jpg)
Images/cinvestav-1.jpg
Outline1 Introduction
2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging
3 CommitteesBootstrap Data Sets
4 BoostingAdaBoost Development
Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick
AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior
Example
27 / 65
![Page 65: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/65.jpg)
Images/cinvestav-1.jpg
Boosting
What Boosting does?It combines several classifiers to produce a form of a committee.
We will describe AdaBoost“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
![Page 66: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/66.jpg)
Images/cinvestav-1.jpg
Boosting
What Boosting does?It combines several classifiers to produce a form of a committee.
We will describe AdaBoost“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
![Page 67: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/67.jpg)
Images/cinvestav-1.jpg
Sequential Training
Main difference between boosting and committee methodsThe base classifiers are trained in sequence.
ExplanationConsider a two-class classification problem:
1 Samples x1,x2,...,xN2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
![Page 68: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/68.jpg)
Images/cinvestav-1.jpg
Sequential Training
Main difference between boosting and committee methodsThe base classifiers are trained in sequence.
ExplanationConsider a two-class classification problem:
1 Samples x1,x2,...,xN2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
![Page 69: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/69.jpg)
Images/cinvestav-1.jpg
Cost Function
NowYou want to put together a set of M experts able to recognize the mostdifficult inputs in an accurate way!!!
ThusFor each pattern xi each expert classifier outputs a classificationyj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (x i))
C (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αM yM (xi) (14)
30 / 65
![Page 70: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/70.jpg)
Images/cinvestav-1.jpg
Cost Function
NowYou want to put together a set of M experts able to recognize the mostdifficult inputs in an accurate way!!!
ThusFor each pattern xi each expert classifier outputs a classificationyj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (x i))
C (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αM yM (xi) (14)
30 / 65
![Page 71: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/71.jpg)
Images/cinvestav-1.jpg
Cost Function
NowYou want to put together a set of M experts able to recognize the mostdifficult inputs in an accurate way!!!
ThusFor each pattern xi each expert classifier outputs a classificationyj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (x i))
C (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αM yM (xi) (14)
30 / 65
![Page 72: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/72.jpg)
Images/cinvestav-1.jpg
Now
Adaptive BoostingIt works even with a continuum of classifiers.
HoweverFor the sake of simplicity, we will assume that the set of expert is finite.
31 / 65
![Page 73: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/73.jpg)
Images/cinvestav-1.jpg
Now
Adaptive BoostingIt works even with a continuum of classifiers.
HoweverFor the sake of simplicity, we will assume that the set of expert is finite.
31 / 65
![Page 74: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/74.jpg)
Images/cinvestav-1.jpg
Getting the correct classifiers
We want the followingWe want to review possible element members.Select them, if they have certain properties.Assigning a weight to their contribution to the set of experts.
32 / 65
![Page 75: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/75.jpg)
Images/cinvestav-1.jpg
Getting the correct classifiers
We want the followingWe want to review possible element members.Select them, if they have certain properties.Assigning a weight to their contribution to the set of experts.
32 / 65
![Page 76: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/76.jpg)
Images/cinvestav-1.jpg
Getting the correct classifiers
We want the followingWe want to review possible element members.Select them, if they have certain properties.Assigning a weight to their contribution to the set of experts.
32 / 65
![Page 77: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/77.jpg)
Images/cinvestav-1.jpg
Now
Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actionsWe test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).
33 / 65
![Page 78: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/78.jpg)
Images/cinvestav-1.jpg
Now
Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actionsWe test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).
33 / 65
![Page 79: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/79.jpg)
Images/cinvestav-1.jpg
Now
Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actionsWe test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).
33 / 65
![Page 80: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/80.jpg)
Images/cinvestav-1.jpg
Now
Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actionsWe test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).
33 / 65
![Page 81: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/81.jpg)
Images/cinvestav-1.jpg
Now
Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actionsWe test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).
33 / 65
![Page 82: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/82.jpg)
Images/cinvestav-1.jpg
Remarks about β
We require β > 0Thus misses are penalized more heavily penalized than hits
AlthoughIt looks strange to penalize hits, but as long that the penalty of a successis smaller than the penalty for a miss exp {−β} < exp {β}.
Why?Show that if we assign cost a to misses and cost b to hits, wherea > b > 0, we can rewrite such costs as a = exp {d}and b = exp {−d} forconstants c and d. Thus, it does not compromise generality.
34 / 65
![Page 83: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/83.jpg)
Images/cinvestav-1.jpg
Remarks about β
We require β > 0Thus misses are penalized more heavily penalized than hits
AlthoughIt looks strange to penalize hits, but as long that the penalty of a successis smaller than the penalty for a miss exp {−β} < exp {β}.
Why?Show that if we assign cost a to misses and cost b to hits, wherea > b > 0, we can rewrite such costs as a = exp {d}and b = exp {−d} forconstants c and d. Thus, it does not compromise generality.
34 / 65
![Page 84: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/84.jpg)
Images/cinvestav-1.jpg
Remarks about β
We require β > 0Thus misses are penalized more heavily penalized than hits
AlthoughIt looks strange to penalize hits, but as long that the penalty of a successis smaller than the penalty for a miss exp {−β} < exp {β}.
Why?Show that if we assign cost a to misses and cost b to hits, wherea > b > 0, we can rewrite such costs as a = exp {d}and b = exp {−d} forconstants c and d. Thus, it does not compromise generality.
34 / 65
![Page 85: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/85.jpg)
Images/cinvestav-1.jpg
Exponential Loss Function
Something NotableThis kind of error function different from the usual squared Euclideandistance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
![Page 86: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/86.jpg)
Images/cinvestav-1.jpg
Exponential Loss Function
Something NotableThis kind of error function different from the usual squared Euclideandistance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
![Page 87: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/87.jpg)
Images/cinvestav-1.jpg
The Matrix S
When we test the M classifiers in the pool, we build a matrix SWe record the misses (with a ONE) and hits (with a ZERO) of eachclassifiers.
Row i in the matrix is reserved for the data point xi - Column j isreserved for the jth classifier in the pool.
Classifiers
1 2 · · · Mx1 0 1 · · · 1x2 0 0 · · · 1x3 1 1 · · · 0...
......
...xN 0 0 · · · 0
36 / 65
![Page 88: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/88.jpg)
Images/cinvestav-1.jpg
The Matrix S
When we test the M classifiers in the pool, we build a matrix SWe record the misses (with a ONE) and hits (with a ZERO) of eachclassifiers.
Row i in the matrix is reserved for the data point xi - Column j isreserved for the jth classifier in the pool.
Classifiers
1 2 · · · Mx1 0 1 · · · 1x2 0 0 · · · 1x3 1 1 · · · 0...
......
...xN 0 0 · · · 0
36 / 65
![Page 89: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/89.jpg)
Images/cinvestav-1.jpg
The Matrix S
When we test the M classifiers in the pool, we build a matrix SWe record the misses (with a ONE) and hits (with a ZERO) of eachclassifiers.
Row i in the matrix is reserved for the data point xi - Column j isreserved for the jth classifier in the pool.
Classifiers
1 2 · · · Mx1 0 1 · · · 1x2 0 0 · · · 1x3 1 1 · · · 0...
......
...xN 0 0 · · · 0
36 / 65
![Page 90: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/90.jpg)
Images/cinvestav-1.jpg
Main Idea
What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.
ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.
Thus at the beginning of the iterationsAll data samples are assigned the same weight:
Just 1, or 1N , if we want to have a total sum of 1 for all weights.
37 / 65
![Page 91: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/91.jpg)
Images/cinvestav-1.jpg
Main Idea
What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.
ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.
Thus at the beginning of the iterationsAll data samples are assigned the same weight:
Just 1, or 1N , if we want to have a total sum of 1 for all weights.
37 / 65
![Page 92: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/92.jpg)
Images/cinvestav-1.jpg
Main Idea
What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.
ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.
Thus at the beginning of the iterationsAll data samples are assigned the same weight:
Just 1, or 1N , if we want to have a total sum of 1 for all weights.
37 / 65
![Page 93: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/93.jpg)
Images/cinvestav-1.jpg
Main Idea
What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.
ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.
Thus at the beginning of the iterationsAll data samples are assigned the same weight:
Just 1, or 1N , if we want to have a total sum of 1 for all weights.
37 / 65
![Page 94: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/94.jpg)
Images/cinvestav-1.jpg
The process of the weights
As the drafting progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.
BasicallyThe drafting process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.
ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being drafted should complement each other in an optimalway.
38 / 65
![Page 95: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/95.jpg)
Images/cinvestav-1.jpg
The process of the weights
As the drafting progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.
BasicallyThe drafting process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.
ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being drafted should complement each other in an optimalway.
38 / 65
![Page 96: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/96.jpg)
Images/cinvestav-1.jpg
The process of the weights
As the drafting progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.
BasicallyThe drafting process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.
ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being drafted should complement each other in an optimalway.
38 / 65
![Page 97: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/97.jpg)
Images/cinvestav-1.jpg
The process of the weights
As the drafting progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.
BasicallyThe drafting process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.
ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being drafted should complement each other in an optimalway.
38 / 65
![Page 98: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/98.jpg)
Images/cinvestav-1.jpg
Selecting New Classifiers
What we wantIn each iteration, we need to rank all classifiers, so that we can select thecurrent best out of the pool.
At jth iterationWe have already included m − 1 classifiers in the committee and we wantto draft the next one.
Thus, we have the following cost function
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αm−1ym−1 (xi) (15)
39 / 65
![Page 99: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/99.jpg)
Images/cinvestav-1.jpg
Selecting New Classifiers
What we wantIn each iteration, we need to rank all classifiers, so that we can select thecurrent best out of the pool.
At jth iterationWe have already included m − 1 classifiers in the committee and we wantto draft the next one.
Thus, we have the following cost function
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αm−1ym−1 (xi) (15)
39 / 65
![Page 100: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/100.jpg)
Images/cinvestav-1.jpg
Selecting New Classifiers
What we wantIn each iteration, we need to rank all classifiers, so that we can select thecurrent best out of the pool.
At jth iterationWe have already included m − 1 classifiers in the committee and we wantto draft the next one.
Thus, we have the following cost function
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αm−1ym−1 (xi) (15)
39 / 65
![Page 101: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/101.jpg)
Images/cinvestav-1.jpg
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =N∑
i=1exp
{−ti
(C(m−1) (xi) + αmym (xi)
)}(17)
40 / 65
![Page 102: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/102.jpg)
Images/cinvestav-1.jpg
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =N∑
i=1exp
{−ti
(C(m−1) (xi) + αmym (xi)
)}(17)
40 / 65
![Page 103: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/103.jpg)
Images/cinvestav-1.jpg
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =N∑
i=1exp
{−ti
(C(m−1) (xi) + αmym (xi)
)}(17)
40 / 65
![Page 104: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/104.jpg)
Images/cinvestav-1.jpg
Thus
We want to determineαm and ym in optimal way
Thus, rewriting
E =N∑
i=1w(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ...,N
w(m)i = exp
{−tiC(m−1) (xi)
}(19)
41 / 65
![Page 105: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/105.jpg)
Images/cinvestav-1.jpg
Thus
We want to determineαm and ym in optimal way
Thus, rewriting
E =N∑
i=1w(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ...,N
w(m)i = exp
{−tiC(m−1) (xi)
}(19)
41 / 65
![Page 106: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/106.jpg)
Images/cinvestav-1.jpg
Thus
We want to determineαm and ym in optimal way
Thus, rewriting
E =N∑
i=1w(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ...,N
w(m)i = exp
{−tiC(m−1) (xi)
}(19)
41 / 65
![Page 107: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/107.jpg)
Images/cinvestav-1.jpg
Thus
In the first iteration w(1)i = 1 for i = 1, ...,N
During later iterations, the vector w(m) represents the weight assigned toeach data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =∑
ti=ym(xi)w(m)
i exp {−αm}+∑
ti 6=ym(xi)w(m)
i exp {αm} (20)
MeaningThe total cost is the weighted cost of all hits plus the weighted cost of allmisses.
42 / 65
![Page 108: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/108.jpg)
Images/cinvestav-1.jpg
Thus
In the first iteration w(1)i = 1 for i = 1, ...,N
During later iterations, the vector w(m) represents the weight assigned toeach data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =∑
ti=ym(xi)w(m)
i exp {−αm}+∑
ti 6=ym(xi)w(m)
i exp {αm} (20)
MeaningThe total cost is the weighted cost of all hits plus the weighted cost of allmisses.
42 / 65
![Page 109: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/109.jpg)
Images/cinvestav-1.jpg
Thus
In the first iteration w(1)i = 1 for i = 1, ...,N
During later iterations, the vector w(m) represents the weight assigned toeach data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =∑
ti=ym(xi)w(m)
i exp {−αm}+∑
ti 6=ym(xi)w(m)
i exp {αm} (20)
MeaningThe total cost is the weighted cost of all hits plus the weighted cost of allmisses.
42 / 65
![Page 110: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/110.jpg)
Images/cinvestav-1.jpg
Now
Writing the first summand as Wc exp {−αm} and the second asWe exp {αm}
E = Wc exp {−αm}+ We exp {αm} (21)
Now, for the selection of km the exact value of αm > 0 is irrelevantSince a fixed αm minimizing E is equivalent to minimizing exp {αm}E and
exp {αm}E = Wc + We exp {2αm} (22)
43 / 65
![Page 111: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/111.jpg)
Images/cinvestav-1.jpg
Now
Writing the first summand as Wc exp {−αm} and the second asWe exp {αm}
E = Wc exp {−αm}+ We exp {αm} (21)
Now, for the selection of km the exact value of αm > 0 is irrelevantSince a fixed αm minimizing E is equivalent to minimizing exp {αm}E and
exp {αm}E = Wc + We exp {2αm} (22)
43 / 65
![Page 112: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/112.jpg)
Images/cinvestav-1.jpg
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm}E = Wc + We −We + We exp {2αm} (23)
Thus
exp {αm}E = (Wc + We) + We (exp {2αm} − 1) (24)
NowNow, Wc + We is the total sum W of the weights of all data points whichis constant in the current iteration.
44 / 65
![Page 113: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/113.jpg)
Images/cinvestav-1.jpg
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm}E = Wc + We −We + We exp {2αm} (23)
Thus
exp {αm}E = (Wc + We) + We (exp {2αm} − 1) (24)
NowNow, Wc + We is the total sum W of the weights of all data points whichis constant in the current iteration.
44 / 65
![Page 114: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/114.jpg)
Images/cinvestav-1.jpg
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm}E = Wc + We −We + We exp {2αm} (23)
Thus
exp {αm}E = (Wc + We) + We (exp {2αm} − 1) (24)
NowNow, Wc + We is the total sum W of the weights of all data points whichis constant in the current iteration.
44 / 65
![Page 115: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/115.jpg)
Images/cinvestav-1.jpg
Thus
The right hand side of the equation is minimizedWhen at the m-th iteration we pick the classifier with the lowest total costWe (that is the lowest rate of weighted error).
IntuitivelyThe next selected ym should be the one with the lowest penalty given thecurrent set of weights.
45 / 65
![Page 116: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/116.jpg)
Images/cinvestav-1.jpg
Thus
The right hand side of the equation is minimizedWhen at the m-th iteration we pick the classifier with the lowest total costWe (that is the lowest rate of weighted error).
IntuitivelyThe next selected ym should be the one with the lowest penalty given thecurrent set of weights.
45 / 65
![Page 117: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/117.jpg)
Images/cinvestav-1.jpg
Deriving against the weight αm
We have∂E∂αm
= −Wc exp {−αm}+ We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm = 12 ln
(WcWe
)(27)
46 / 65
![Page 118: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/118.jpg)
Images/cinvestav-1.jpg
Deriving against the weight αm
We have∂E∂αm
= −Wc exp {−αm}+ We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm = 12 ln
(WcWe
)(27)
46 / 65
![Page 119: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/119.jpg)
Images/cinvestav-1.jpg
Deriving against the weight αm
We have∂E∂αm
= −Wc exp {−αm}+ We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm = 12 ln
(WcWe
)(27)
46 / 65
![Page 120: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/120.jpg)
Images/cinvestav-1.jpg
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm = 12 ln
(W −WeWe
)= 1
2 ln(1− em
em
)(29)
With the percentage rate of error given the weights of the data points
em = WeW (30)
47 / 65
![Page 121: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/121.jpg)
Images/cinvestav-1.jpg
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm = 12 ln
(W −WeWe
)= 1
2 ln(1− em
em
)(29)
With the percentage rate of error given the weights of the data points
em = WeW (30)
47 / 65
![Page 122: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/122.jpg)
Images/cinvestav-1.jpg
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm = 12 ln
(W −WeWe
)= 1
2 ln(1− em
em
)(29)
With the percentage rate of error given the weights of the data points
em = WeW (30)
47 / 65
![Page 123: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/123.jpg)
Images/cinvestav-1.jpg
What about the weights?
Using the equation
w(m)i = exp
{−tiC(m−1) (xi)
}(31)
And because we have αm and ym (x i)
w(m+1)i = exp
{−tiC(m) (xi)
}= exp
{−ti
[C(m−1) (xi) + αmym (xi)
]}=w(m)
i exp {−tiαmym (xi)}
48 / 65
![Page 124: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/124.jpg)
Images/cinvestav-1.jpg
What about the weights?
Using the equation
w(m)i = exp
{−tiC(m−1) (xi)
}(31)
And because we have αm and ym (x i)
w(m+1)i = exp
{−tiC(m) (xi)
}= exp
{−ti
[C(m−1) (xi) + αmym (xi)
]}=w(m)
i exp {−tiαmym (xi)}
48 / 65
![Page 125: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/125.jpg)
Images/cinvestav-1.jpg
What about the weights?
Using the equation
w(m)i = exp
{−tiC(m−1) (xi)
}(31)
And because we have αm and ym (x i)
w(m+1)i = exp
{−tiC(m) (xi)
}= exp
{−ti
[C(m−1) (xi) + αmym (xi)
]}=w(m)
i exp {−tiαmym (xi)}
48 / 65
![Page 126: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/126.jpg)
Images/cinvestav-1.jpg
What about the weights?
Using the equation
w(m)i = exp
{−tiC(m−1) (xi)
}(31)
And because we have αm and ym (x i)
w(m+1)i = exp
{−tiC(m) (xi)
}= exp
{−ti
[C(m−1) (xi) + αmym (xi)
]}=w(m)
i exp {−tiαmym (xi)}
48 / 65
![Page 127: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/127.jpg)
Images/cinvestav-1.jpg
Sequential Training
ThusAdaBoost trains a new classifier using a data set in which the weightingcoefficients are adjusted according to the performance of the previouslytrained classifier so as to give greater weight to the misclassified datapoints.
49 / 65
![Page 128: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/128.jpg)
Images/cinvestav-1.jpg
Illustration
Schematic Illustration
50 / 65
![Page 129: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/129.jpg)
Images/cinvestav-1.jpg
Outline1 Introduction
2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging
3 CommitteesBootstrap Data Sets
4 BoostingAdaBoost Development
Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick
AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior
Example
51 / 65
![Page 130: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/130.jpg)
Images/cinvestav-1.jpg
AdaBoost AlgorithmStep 1Initialize
{w(1)
i
}to 1
N
Step 2For m = 1, 2, ...,M
1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or
arg minym
N∑i=1
w(m)i I (ym (xi) 6= tn) = arg min
ym
∑ti 6=ym(xi)
w(m)i = arg min
ymWe (32)
Where I is an indicator function.2 Evaluate
em =∑N
n=1 w(m)n I (ym (xn) 6= tn)∑N
n=1 w(m)n
(33)
Where I is an indicator function 52 / 65
![Page 131: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/131.jpg)
Images/cinvestav-1.jpg
AdaBoost AlgorithmStep 1Initialize
{w(1)
i
}to 1
N
Step 2For m = 1, 2, ...,M
1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or
arg minym
N∑i=1
w(m)i I (ym (xi) 6= tn) = arg min
ym
∑ti 6=ym(xi)
w(m)i = arg min
ymWe (32)
Where I is an indicator function.2 Evaluate
em =∑N
n=1 w(m)n I (ym (xn) 6= tn)∑N
n=1 w(m)n
(33)
Where I is an indicator function 52 / 65
![Page 132: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/132.jpg)
Images/cinvestav-1.jpg
AdaBoost AlgorithmStep 1Initialize
{w(1)
i
}to 1
N
Step 2For m = 1, 2, ...,M
1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or
arg minym
N∑i=1
w(m)i I (ym (xi) 6= tn) = arg min
ym
∑ti 6=ym(xi)
w(m)i = arg min
ymWe (32)
Where I is an indicator function.2 Evaluate
em =∑N
n=1 w(m)n I (ym (xn) 6= tn)∑N
n=1 w(m)n
(33)
Where I is an indicator function 52 / 65
![Page 133: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/133.jpg)
Images/cinvestav-1.jpg
AdaBoost AlgorithmStep 1Initialize
{w(1)
i
}to 1
N
Step 2For m = 1, 2, ...,M
1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or
arg minym
N∑i=1
w(m)i I (ym (xi) 6= tn) = arg min
ym
∑ti 6=ym(xi)
w(m)i = arg min
ymWe (32)
Where I is an indicator function.2 Evaluate
em =∑N
n=1 w(m)n I (ym (xn) 6= tn)∑N
n=1 w(m)n
(33)
Where I is an indicator function 52 / 65
![Page 134: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/134.jpg)
Images/cinvestav-1.jpg
AdaBoost AlgorithmStep 1Initialize
{w(1)
i
}to 1
N
Step 2For m = 1, 2, ...,M
1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or
arg minym
N∑i=1
w(m)i I (ym (xi) 6= tn) = arg min
ym
∑ti 6=ym(xi)
w(m)i = arg min
ymWe (32)
Where I is an indicator function.2 Evaluate
em =∑N
n=1 w(m)n I (ym (xn) 6= tn)∑N
n=1 w(m)n
(33)
Where I is an indicator function 52 / 65
![Page 135: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/135.jpg)
Images/cinvestav-1.jpg
AdaBoost AlgorithmStep 1Initialize
{w(1)
i
}to 1
N
Step 2For m = 1, 2, ...,M
1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or
arg minym
N∑i=1
w(m)i I (ym (xi) 6= tn) = arg min
ym
∑ti 6=ym(xi)
w(m)i = arg min
ymWe (32)
Where I is an indicator function.2 Evaluate
em =∑N
n=1 w(m)n I (ym (xn) 6= tn)∑N
n=1 w(m)n
(33)
Where I is an indicator function 52 / 65
![Page 136: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/136.jpg)
Images/cinvestav-1.jpg
AdaBoost Algorithm
Step 3Set the αm weight to
αm = 12 ln
{1− em
em
}(34)
Now update the weights of the data for the next iterationIf ti 6= ym (xi) i.e. a miss
w(m+1)i = w(m)
i exp {αm} = w(m)i
√1− em
em(35)
If ti == ym (xi) i.e. a hit
w(m+1)i = w(m)
i exp {−αm} = w(m)i
√em
1− em(36)
53 / 65
![Page 137: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/137.jpg)
Images/cinvestav-1.jpg
AdaBoost Algorithm
Step 3Set the αm weight to
αm = 12 ln
{1− em
em
}(34)
Now update the weights of the data for the next iterationIf ti 6= ym (xi) i.e. a miss
w(m+1)i = w(m)
i exp {αm} = w(m)i
√1− em
em(35)
If ti == ym (xi) i.e. a hit
w(m+1)i = w(m)
i exp {−αm} = w(m)i
√em
1− em(36)
53 / 65
![Page 138: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/138.jpg)
Images/cinvestav-1.jpg
AdaBoost Algorithm
Step 3Set the αm weight to
αm = 12 ln
{1− em
em
}(34)
Now update the weights of the data for the next iterationIf ti 6= ym (xi) i.e. a miss
w(m+1)i = w(m)
i exp {αm} = w(m)i
√1− em
em(35)
If ti == ym (xi) i.e. a hit
w(m+1)i = w(m)
i exp {−αm} = w(m)i
√em
1− em(36)
53 / 65
![Page 139: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/139.jpg)
Images/cinvestav-1.jpg
Finally, make predictions
For this use
YM (x) = sign( M∑
m=1αmym (x)
)(37)
54 / 65
![Page 140: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/140.jpg)
Images/cinvestav-1.jpg
Observations
FirstThe first base classifier is the usual procedure of training a single classifier.
SecondFrom (Eq. 35) and (Eq. 36), we can see that the weighting coefficient areincreased for data points that are misclassified.
ThirdThe quantity em represent weighted measures of the error rate.Thus αm gives more weight to the more accurate classifiers.
55 / 65
![Page 141: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/141.jpg)
Images/cinvestav-1.jpg
Observations
FirstThe first base classifier is the usual procedure of training a single classifier.
SecondFrom (Eq. 35) and (Eq. 36), we can see that the weighting coefficient areincreased for data points that are misclassified.
ThirdThe quantity em represent weighted measures of the error rate.Thus αm gives more weight to the more accurate classifiers.
55 / 65
![Page 142: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/142.jpg)
Images/cinvestav-1.jpg
Observations
FirstThe first base classifier is the usual procedure of training a single classifier.
SecondFrom (Eq. 35) and (Eq. 36), we can see that the weighting coefficient areincreased for data points that are misclassified.
ThirdThe quantity em represent weighted measures of the error rate.Thus αm gives more weight to the more accurate classifiers.
55 / 65
![Page 143: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/143.jpg)
Images/cinvestav-1.jpg
In addition
The pool of classifiers in Step 1 can be substituted by a family ofclassifiersOne whose members are trained to minimize the error function given thecurrent weights
NowIf indeed a finite set of classifiers is given, we only need to test theclassifiers once for each data point
The Scouting Matrix SIt can be reused at each iteration by multiplying the transposed vector ofweights w(m) with S to obtain We of each machine
56 / 65
![Page 144: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/144.jpg)
Images/cinvestav-1.jpg
In addition
The pool of classifiers in Step 1 can be substituted by a family ofclassifiersOne whose members are trained to minimize the error function given thecurrent weights
NowIf indeed a finite set of classifiers is given, we only need to test theclassifiers once for each data point
The Scouting Matrix SIt can be reused at each iteration by multiplying the transposed vector ofweights w(m) with S to obtain We of each machine
56 / 65
![Page 145: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/145.jpg)
Images/cinvestav-1.jpg
In addition
The pool of classifiers in Step 1 can be substituted by a family ofclassifiersOne whose members are trained to minimize the error function given thecurrent weights
NowIf indeed a finite set of classifiers is given, we only need to test theclassifiers once for each data point
The Scouting Matrix SIt can be reused at each iteration by multiplying the transposed vector ofweights w(m) with S to obtain We of each machine
56 / 65
![Page 146: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/146.jpg)
Images/cinvestav-1.jpg
We have then
The following [W (1)
e W (2)e · · ·W M
e
]=(w(m)
)TS (38)
This allows to reformulate the weight update step such thatIt only misses lead to weight modification.
NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.
57 / 65
![Page 147: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/147.jpg)
Images/cinvestav-1.jpg
We have then
The following [W (1)
e W (2)e · · ·W M
e
]=(w(m)
)TS (38)
This allows to reformulate the weight update step such thatIt only misses lead to weight modification.
NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.
57 / 65
![Page 148: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/148.jpg)
Images/cinvestav-1.jpg
We have then
The following [W (1)
e W (2)e · · ·W M
e
]=(w(m)
)TS (38)
This allows to reformulate the weight update step such thatIt only misses lead to weight modification.
NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.
57 / 65
![Page 149: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/149.jpg)
Images/cinvestav-1.jpg
We have then
The following [W (1)
e W (2)e · · ·W M
e
]=(w(m)
)TS (38)
This allows to reformulate the weight update step such thatIt only misses lead to weight modification.
NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.
57 / 65
![Page 150: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/150.jpg)
Images/cinvestav-1.jpg
Explanation
Graph for 1−emem
in the range 0 ≤ em ≤ 1
58 / 65
![Page 151: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/151.jpg)
Images/cinvestav-1.jpg
So we have
Graph for αm
59 / 65
![Page 152: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/152.jpg)
Images/cinvestav-1.jpg
We have the following cases
We have the followingIf em −→ 1, we have that all the sample were not correctly classified!!!
ThusWe get that for all miss-classified sample lim
em→1
1−emem−→ 0, then
αm −→ −∞
FinallyWe get that for all miss-classified sample w(m+1)
i = w(m)i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier andselect it as the only committee member.
60 / 65
![Page 153: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/153.jpg)
Images/cinvestav-1.jpg
We have the following cases
We have the followingIf em −→ 1, we have that all the sample were not correctly classified!!!
ThusWe get that for all miss-classified sample lim
em→1
1−emem−→ 0, then
αm −→ −∞
FinallyWe get that for all miss-classified sample w(m+1)
i = w(m)i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier andselect it as the only committee member.
60 / 65
![Page 154: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/154.jpg)
Images/cinvestav-1.jpg
We have the following cases
We have the followingIf em −→ 1, we have that all the sample were not correctly classified!!!
ThusWe get that for all miss-classified sample lim
em→1
1−emem−→ 0, then
αm −→ −∞
FinallyWe get that for all miss-classified sample w(m+1)
i = w(m)i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier andselect it as the only committee member.
60 / 65
![Page 155: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/155.jpg)
Images/cinvestav-1.jpg
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or badclassified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0We have that αm → +∞
Thus, we haveSamples always correctly classifiedw(m+1)
i = w(m)i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
![Page 156: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/156.jpg)
Images/cinvestav-1.jpg
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or badclassified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0We have that αm → +∞
Thus, we haveSamples always correctly classifiedw(m+1)
i = w(m)i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
![Page 157: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/157.jpg)
Images/cinvestav-1.jpg
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or badclassified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0We have that αm → +∞
Thus, we haveSamples always correctly classifiedw(m+1)
i = w(m)i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
![Page 158: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/158.jpg)
Images/cinvestav-1.jpg
Outline1 Introduction
2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging
3 CommitteesBootstrap Data Sets
4 BoostingAdaBoost Development
Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick
AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior
Example
62 / 65
![Page 159: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/159.jpg)
Images/cinvestav-1.jpg
Example
Training set
Classesω1 = N (0, 1)ω2 = N (0, σ2)−N (0, 1)
63 / 65
![Page 160: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/160.jpg)
Images/cinvestav-1.jpg
Example
Weak ClassifierPerceptron
For m = 1, ...,T
64 / 65
![Page 161: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/161.jpg)
Images/cinvestav-1.jpg
Example
Weak ClassifierPerceptron
For m = 1, ...,T
64 / 65
![Page 162: Machine Learning for Data Mining - Combining models](https://reader034.fdocuments.us/reader034/viewer/2022042819/55c8291abb61eb8d4d8b4611/html5/thumbnails/162.jpg)
Images/cinvestav-1.jpg
As the process progress
for m = 40
65 / 65