Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... ·...

36
Artificial Intelligence Machine Learning Ensemble Methods

Transcript of Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... ·...

Page 1: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Artificial Intelligence

Machine Learning

Ensemble Methods

Page 2: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Outline

1. Majority voting

2. Boosting

3. Bagging

4. Random Forests

5. Conclusion

Page 3: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

Page 4: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

Page 5: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

• Suppose we have m classifiers, performing slightly better than

random, that is error = 0.5-ε.

Page 6: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

• Suppose we have m classifiers, performing slightly better than

random, that is error = 0.5-ε.

• Combine: make a decision based on majority vote?

Page 7: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

• Suppose we have m classifiers, performing slightly better than

random, that is error = 0.5-ε.

• Combine: make a decision based on majority vote?

• What if we combined these m slightly-better-than-random

classifiers? Would majority vote be a good choice?

Page 8: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Condorcet’s Jury Theorem

Marquis de Condorcet Appli-

cation of Analysis to the Prob-

ability of Majority Decisions.

1785.

Assumptions:

1. Each individual makes the right choice with a probability p.

2. The votes are independent.

Page 9: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Condorcet’s Jury Theorem

Marquis de Condorcet Appli-

cation of Analysis to the Prob-

ability of Majority Decisions.

1785.

Assumptions:

1. Each individual makes the right choice with a probability p.

2. The votes are independent.

If p > 0.5, then adding more voters increases the probability that the majority

decision is correct. if p < 0.5, then adding more voters makes things worse.

Page 10: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

Page 11: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

• Such individual classifiers, called weak learners, are required

to perform slightly better than random.

Page 12: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

• Such individual classifiers, called weak learners, are required

to perform slightly better than random.

How do we produce independent weak learners

using the same training data?

Page 13: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

• Such individual classifiers, called weak learners, are required

to perform slightly better than random.

How do we produce independent weak learners

using the same training data?

• Use a strategy to obtain relatively independent weak learners!

• Different methods:

1. Boosting

2. Bagging

3. Random Forests

Page 14: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

• First ensemble method.

• One of the most powerful Machine Learning methods.

• Popular algorithm: AdaBoost.M1 (Freund and Shapire 1997).

• “Best off-the-shelf classifier in the world” Breiman (CART’s

inventor), 1998.

• Simple algorithm.

• Weak learners can be trees, perceptrons, decision stumps, etc.

• Idea:

Train the weak learners on weighted training examples.

Page 15: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

Page 16: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

• The predictions from all of the Gm, m ∈ {1, · · · ,M} are com-

bined with a weighted majority voting.

• αm is the contribution of each weak learner Gm.

• Computed by the boosting algorithm to give a weighted im-

portance to the classifiers in the sequence.

• The decision of a highly-performing classifier in the sequence

should weight more than less important classifiers in the se-

quence.

• This is captured in:

G(x) = sign[M∑

m=1

αmGm(x)]

Page 17: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

The error rate on the training sample:

err :=

∑ni=1 1{yi 6= G(xi)}

n

Page 18: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

The error rate on the training sample:

err :=

∑ni=1 1{yi 6= G(xi)}

n

The error rate on each weak learner:

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

Page 19: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

The error rate on the training sample:

err :=

∑ni=1 1{yi 6= G(xi)}

n

The error rate on each weak learner:

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

Intuition:

- Give large weights for hard examples.

- Give small weights for easy examples.

Page 20: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Boosting

For each weak learner m, we associate an error errm.

αm = log(1−errmerrm

)

Page 21: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

Page 22: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

Page 23: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

Page 24: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

Page 25: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

(c) Compute

αm = log(1− errmerrm

)

Page 26: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

(c) Compute

αm = log(1− errmerrm

)

(d) wi ← wi.exp[αm.1(yi 6= Gm(xi))] for i = 1, · · · , n.

Page 27: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

(c) Compute

αm = log(1− errmerrm

)

(d) wi ← wi.exp[αm.1(yi 6= Gm(xi))] for i = 1, · · · , n.

3. Output

G(x) = sign[M∑

m=1

αmGm(x)]

Page 28: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Digression: Decision Stumps

This is an example of very weak classifier

A simple 2-terminal node decision tree for binary classification.

f(xi|j, t) =

{+1 if xij > t−1 otherwise

Where j ∈ {1, · · · d}.

Example: A dataset with 10 features, 2,000 examples training

and 10,000 testing.

Page 29: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost Performance340 10. Boosting and Additive Trees

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Boosting Iterations

Te

st

Err

or

Single Stump

244 Node Tree

FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.

random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four. It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best o!-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for data-mining applications, as discussed more fully in Section 10.7 later in thischapter.

10.1.1 Outline of This Chapter

Here is an outline of the developments in this chapter:

• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function. This loss function is

Error rates:

- Random: 50%.

- Stump: 45.8%.

- Large classification tree: 24.7%.

- AdaBoost with stumps: 5.8% after 400 iterations!

Page 30: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost Performance340 10. Boosting and Additive Trees

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Boosting Iterations

Te

st

Err

or

Single Stump

244 Node Tree

FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.

random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four. It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best o!-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for data-mining applications, as discussed more fully in Section 10.7 later in thischapter.

10.1.1 Outline of This Chapter

Here is an outline of the developments in this chapter:

• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function. This loss function is

AdaBoost with Decision stumps lead to a form of:

feature selection

Page 31: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

AdaBoost-Decision Stumps354 10. Boosting and Additive Trees

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative Importance

FIGURE 10.6. Predictor variable importance spectrum for the spam data. Thevariable names are written on the vertical axis.

Page 32: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Bagging & Bootstrapping

• Bootstrap is a re-sampling technique ≡ sampling from the em-

pirical distribution.

• Aims to improve the quality of estimators.

• Bagging and Boosting are based on bootstrapping.

• Both use re-sampling to generate weak learners for classifica-

tion.

• Strategy: Randomly distort data by re-sampling.

• Train weak learners on re-sampled training sets.

• Bootstrap aggregation ≡ Bagging.

Page 33: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Bagging

Training

For b = 1, · · · , B1. Draw a bootstrap sample Bb of size ` from training data.

2. Train a classifier fb on Bb.

Classification: Classify by majority vote among the B trees:

favg :=1

B

B∑b=1

fb(x)

Page 34: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

BaggingBagging works well for trees:

284 8. Model Inference and Averaging

|

x.1 < 0.395

0 1

0

1 0

1

1 0

Original Tree

|

x.1 < 0.555

0

1 0

0

1

b = 1

|

x.2 < 0.205

0 1

0 1

0 1

b = 2

|

x.2 < 0.285

1 1

0

1 0

b = 3

|

x.3 < 0.985

0

1

0 1

1 1

b = 4

|

x.4 < !1.36

0

11 0

1

0

1 0

b = 5

|

x.1 < 0.395

1 1 0 0

1

b = 6

|

x.1 < 0.395

0 1

0 1

1

b = 7

|

x.3 < 0.985

0 1

0 0

1 0

b = 8

|

x.1 < 0.395

0

1

0 1

1 0

b = 9

|

x.1 < 0.555

1 0

1

0 1

b = 10

|

x.1 < 0.555

0 1

0

1

b = 11

FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows theoriginal tree. Eleven trees grown on bootstrap samples are shown. For each tree,the top split is annotated.

Page 35: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Random Forests

1. Random forests: modifies bagging with trees to reduce corre-

lation between trees.

2. Tree training optimizes each split over all dimensions.

3. But for Random forests, choose a different subset of di-

mensions at each split. Number of dimensions chosen m.

4. Optimal split is chosen within the subset.

5. The subset is chosen at random out of all dimensions 1, . . . , d.

6. Recommended m =√d or smaller.

Page 36: Arti cial Intelligence Machine Learning Ensemble MethodsColumbiaX+CSMM.101x+1T2017+type@asset... · Here is an outline of the developments in this chapter: ¥ We show that AdaBoost

Credit• “An Introduction to Statistical Learning, with applications in

R” (Springer, 2013)” with permission from the authors: G.

James, D. Witten, T. Hastie and R. Tibshirani.