Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... ·...
Transcript of Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... ·...
Artificial Intelligence
Machine Learning
Ensemble Methods
Outline
1. Majority voting
2. Boosting
3. Bagging
4. Random Forests
5. Conclusion
Majority Voting
• A randomly chosen hyperplane has an expected error of 0.5.
Majority Voting
• A randomly chosen hyperplane has an expected error of 0.5.
• Many random hyperplanes combined by majority vote will still
be random.
Majority Voting
• A randomly chosen hyperplane has an expected error of 0.5.
• Many random hyperplanes combined by majority vote will still
be random.
• Suppose we have m classifiers, performing slightly better than
random, that is error = 0.5-ε.
Majority Voting
• A randomly chosen hyperplane has an expected error of 0.5.
• Many random hyperplanes combined by majority vote will still
be random.
• Suppose we have m classifiers, performing slightly better than
random, that is error = 0.5-ε.
• Combine: make a decision based on majority vote?
Majority Voting
• A randomly chosen hyperplane has an expected error of 0.5.
• Many random hyperplanes combined by majority vote will still
be random.
• Suppose we have m classifiers, performing slightly better than
random, that is error = 0.5-ε.
• Combine: make a decision based on majority vote?
• What if we combined these m slightly-better-than-random
classifiers? Would majority vote be a good choice?
Condorcet’s Jury Theorem
Marquis de Condorcet Appli-
cation of Analysis to the Prob-
ability of Majority Decisions.
1785.
Assumptions:
1. Each individual makes the right choice with a probability p.
2. The votes are independent.
Condorcet’s Jury Theorem
Marquis de Condorcet Appli-
cation of Analysis to the Prob-
ability of Majority Decisions.
1785.
Assumptions:
1. Each individual makes the right choice with a probability p.
2. The votes are independent.
If p > 0.5, then adding more voters increases the probability that the majority
decision is correct. if p < 0.5, then adding more voters makes things worse.
Ensemble Methods
• An Ensemble Method combines the predictions of many in-
dividual classifiers by majority voting.
Ensemble Methods
• An Ensemble Method combines the predictions of many in-
dividual classifiers by majority voting.
• Such individual classifiers, called weak learners, are required
to perform slightly better than random.
Ensemble Methods
• An Ensemble Method combines the predictions of many in-
dividual classifiers by majority voting.
• Such individual classifiers, called weak learners, are required
to perform slightly better than random.
How do we produce independent weak learners
using the same training data?
Ensemble Methods
• An Ensemble Method combines the predictions of many in-
dividual classifiers by majority voting.
• Such individual classifiers, called weak learners, are required
to perform slightly better than random.
How do we produce independent weak learners
using the same training data?
• Use a strategy to obtain relatively independent weak learners!
• Different methods:
1. Boosting
2. Bagging
3. Random Forests
Boosting
• First ensemble method.
• One of the most powerful Machine Learning methods.
• Popular algorithm: AdaBoost.M1 (Freund and Shapire 1997).
• “Best off-the-shelf classifier in the world” Breiman (CART’s
inventor), 1998.
• Simple algorithm.
• Weak learners can be trees, perceptrons, decision stumps, etc.
• Idea:
Train the weak learners on weighted training examples.
Boosting
Boosting
• The predictions from all of the Gm, m ∈ {1, · · · ,M} are com-
bined with a weighted majority voting.
• αm is the contribution of each weak learner Gm.
• Computed by the boosting algorithm to give a weighted im-
portance to the classifiers in the sequence.
• The decision of a highly-performing classifier in the sequence
should weight more than less important classifiers in the se-
quence.
• This is captured in:
G(x) = sign[M∑
m=1
αmGm(x)]
Boosting
The error rate on the training sample:
err :=
∑ni=1 1{yi 6= G(xi)}
n
Boosting
The error rate on the training sample:
err :=
∑ni=1 1{yi 6= G(xi)}
n
The error rate on each weak learner:
errm :=
∑ni=1 wi 1{yi 6= Gm(xi)}∑n
i=1wi
Boosting
The error rate on the training sample:
err :=
∑ni=1 1{yi 6= G(xi)}
n
The error rate on each weak learner:
errm :=
∑ni=1 wi 1{yi 6= Gm(xi)}∑n
i=1wi
Intuition:
- Give large weights for hard examples.
- Give small weights for easy examples.
Boosting
For each weak learner m, we associate an error errm.
αm = log(1−errmerrm
)
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
2. For m = 1 to M (number of weak learners)
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
2. For m = 1 to M (number of weak learners)
(a) Fit a classifier Gm(x) to training data using the weights wi.
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
2. For m = 1 to M (number of weak learners)
(a) Fit a classifier Gm(x) to training data using the weights wi.
(b) Compute
errm :=
∑ni=1 wi 1{yi 6= Gm(xi)}∑n
i=1wi
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
2. For m = 1 to M (number of weak learners)
(a) Fit a classifier Gm(x) to training data using the weights wi.
(b) Compute
errm :=
∑ni=1 wi 1{yi 6= Gm(xi)}∑n
i=1wi
(c) Compute
αm = log(1− errmerrm
)
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
2. For m = 1 to M (number of weak learners)
(a) Fit a classifier Gm(x) to training data using the weights wi.
(b) Compute
errm :=
∑ni=1 wi 1{yi 6= Gm(xi)}∑n
i=1wi
(c) Compute
αm = log(1− errmerrm
)
(d) wi ← wi.exp[αm.1(yi 6= Gm(xi))] for i = 1, · · · , n.
AdaBoost
1. Initialize the example weights wi = 1n, i = 1, · · · , n.
2. For m = 1 to M (number of weak learners)
(a) Fit a classifier Gm(x) to training data using the weights wi.
(b) Compute
errm :=
∑ni=1 wi 1{yi 6= Gm(xi)}∑n
i=1wi
(c) Compute
αm = log(1− errmerrm
)
(d) wi ← wi.exp[αm.1(yi 6= Gm(xi))] for i = 1, · · · , n.
3. Output
G(x) = sign[M∑
m=1
αmGm(x)]
Digression: Decision Stumps
This is an example of very weak classifier
A simple 2-terminal node decision tree for binary classification.
f(xi|j, t) =
{+1 if xij > t−1 otherwise
Where j ∈ {1, · · · d}.
Example: A dataset with 10 features, 2,000 examples training
and 10,000 testing.
AdaBoost Performance340 10. Boosting and Additive Trees
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Boosting Iterations
Te
st
Err
or
Single Stump
244 Node Tree
FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.
random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four. It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best o!-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for data-mining applications, as discussed more fully in Section 10.7 later in thischapter.
10.1.1 Outline of This Chapter
Here is an outline of the developments in this chapter:
• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function. This loss function is
Error rates:
- Random: 50%.
- Stump: 45.8%.
- Large classification tree: 24.7%.
- AdaBoost with stumps: 5.8% after 400 iterations!
AdaBoost Performance340 10. Boosting and Additive Trees
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Boosting Iterations
Te
st
Err
or
Single Stump
244 Node Tree
FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.
random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four. It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best o!-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for data-mining applications, as discussed more fully in Section 10.7 later in thischapter.
10.1.1 Outline of This Chapter
Here is an outline of the developments in this chapter:
• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function. This loss function is
AdaBoost with Decision stumps lead to a form of:
feature selection
AdaBoost-Decision Stumps354 10. Boosting and Additive Trees
!$
hpremove
freeCAPAVE
yourCAPMAX
georgeCAPTOT
eduyouour
moneywill
1999business
re(
receiveinternet
000email
meeting;
650overmailpm
peopletechnology
hplall
orderaddress
makefont
projectdata
originalreport
conferencelab
[creditparts
#85
tablecs
direct415857
telnetlabs
addresses3d
0 20 40 60 80 100
Relative Importance
FIGURE 10.6. Predictor variable importance spectrum for the spam data. Thevariable names are written on the vertical axis.
Bagging & Bootstrapping
• Bootstrap is a re-sampling technique ≡ sampling from the em-
pirical distribution.
• Aims to improve the quality of estimators.
• Bagging and Boosting are based on bootstrapping.
• Both use re-sampling to generate weak learners for classifica-
tion.
• Strategy: Randomly distort data by re-sampling.
• Train weak learners on re-sampled training sets.
• Bootstrap aggregation ≡ Bagging.
Bagging
Training
For b = 1, · · · , B1. Draw a bootstrap sample Bb of size ` from training data.
2. Train a classifier fb on Bb.
Classification: Classify by majority vote among the B trees:
favg :=1
B
B∑b=1
fb(x)
BaggingBagging works well for trees:
284 8. Model Inference and Averaging
|
x.1 < 0.395
0 1
0
1 0
1
1 0
Original Tree
|
x.1 < 0.555
0
1 0
0
1
b = 1
|
x.2 < 0.205
0 1
0 1
0 1
b = 2
|
x.2 < 0.285
1 1
0
1 0
b = 3
|
x.3 < 0.985
0
1
0 1
1 1
b = 4
|
x.4 < !1.36
0
11 0
1
0
1 0
b = 5
|
x.1 < 0.395
1 1 0 0
1
b = 6
|
x.1 < 0.395
0 1
0 1
1
b = 7
|
x.3 < 0.985
0 1
0 0
1 0
b = 8
|
x.1 < 0.395
0
1
0 1
1 0
b = 9
|
x.1 < 0.555
1 0
1
0 1
b = 10
|
x.1 < 0.555
0 1
0
1
b = 11
FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows theoriginal tree. Eleven trees grown on bootstrap samples are shown. For each tree,the top split is annotated.
Random Forests
1. Random forests: modifies bagging with trees to reduce corre-
lation between trees.
2. Tree training optimizes each split over all dimensions.
3. But for Random forests, choose a different subset of di-
mensions at each split. Number of dimensions chosen m.
4. Optimal split is chosen within the subset.
5. The subset is chosen at random out of all dimensions 1, . . . , d.
6. Recommended m =√d or smaller.
Credit• “An Introduction to Statistical Learning, with applications in
R” (Springer, 2013)” with permission from the authors: G.
James, D. Witten, T. Hastie and R. Tibshirani.