DATA 201 - Techniques in Data Science - Lecture 15 ... · Adapted from \Hands-On Machine Learning...

$: DATA 201 - Techniques in Data Science - Lecture 15 ... · Adapted from \Hands-On Machine Learning with Scikit-Learn and TensorFlow" by Aur elien G eron. Table of contents 1. Voting$
DATA 201 - Techniques in Data Science

Lecture 15: Ensemble Learning

Binh Nguyen

School of Mathematics and Statistics, Victoria University of Wellington

Adapted from “Hands-On Machine Learning with Scikit-Learn and TensorFlow” by Aurelien Geron

Table of contents

1. Voting Classifiers

2. Bagging and Pasting

3. Random Patches and Random Subspaces

4. Random Forests

5. Boosting

1

Voting Classifiers

Training diverse classifiers

• Suppose we have trained a few classifiers, each one achieving a

similar accuracy.

• There is a simple way to create a better classifier...

2

Hard voting classifier predictions

• Hard voting: aggregate the predictions of each classifier and predict

the class that gets the most votes. This voting classifier often

achieves a higher accuracy than the best classifier in the ensemble.

• If each classifier is a weak learner, the ensemble can still be a strong

learner, provided there are a sufficient number of weak learners and

they are sufficiently diverse. 3

Example

from sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVC

log_clf = LogisticRegression(random_state=42)rnd_clf = RandomForestClassifier(random_state=42)svm_clf = SVC(random_state=42)

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],voting='hard')

4

VotingClassifier(estimators=[('lr',LogisticRegression(C=1.0, class_weight=None,

dual=False, fit_intercept=True,intercept_scaling=1,l1_ratio=None, max_iter=100,multi_class='auto',n_jobs=None, penalty='l2',random_state=42,solver='lbfgs', tol=0.0001,verbose=0, warm_start=False)),

('rf',RandomForestClassifier(bootstrap=True,

ccp_alpha=0.0,class_weight=None,crit...oob_score=False,random_state=42, verbose=0,warm_start=False)),

('svc',SVC(C=1.0, break_ties=False, cache_size=200,

class_weight=None, coef0=0.0,decision_function_shape='ovr', degree=3,gamma='scale', kernel='rbf', max_iter=-1,probability=False, random_state=42,shrinking=True, tol=0.001, verbose=False))],

flatten_transform=True, n_jobs=None, voting='hard',weights=None)

voting_clf.fit(X_train, y_train)

5

LogisticRegression 0.864RandomForestClassifier 0.896SVC 0.896VotingClassifier 0.912


from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

log_clf = LogisticRegression(random_state=42)rnd_clf = RandomForestClassifier(random_state=42)svm_clf = SVC(probability=True, random_state=42)

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],voting='soft')

voting_clf.fit(X_train, y_train);


6

Soft voting

• If all classifiers are able to estimate class probabilities (i.e., they have

a predict proba() method), then we can predict the class with

the highest class probability, averaged over all the individual

classifiers ⇒ soft voting.

• Soft voting often achieves higher performance than hard voting

because it gives more weight to highly confident votes.

7

Switching voting method

Now let's try using a hard voting classifier again. We do not actually need to retrain the

classifier, we can just set voting to "hard":


voting_clf.voting = 'hard'


8

Comments

• Ensemble methods work best when the predictors are as independent

from one another as possible.

• One way to get diverse classifiers is to train them using very

different algorithms. This increases the chance that they will make

very different types of errors, improving the ensemble’s accuracy.

9

Bagging and Pasting

Introduction

• One way to get a diverse set of classifiers is to use very different

training algorithms.

• Another method is to use the same training algorithm for every

predictor, but to train them on different random subsets of the

training set.

- Bagging: when sampling is performed with replacement.

- Pasting: when sampling is performed without replacement.

10

Pasting/bagging training set sampling and training

11

Bootstrap (sampling with replacement)

12

After training

• The ensemble make a prediction for a new instance by aggregating

the predictions of all predictors.

• The aggregation function is typically the statistical mode for

classification, or the average for regression.

• Each individual predictor has a higher bias than if it were trained on

the original training set, but aggregation reduces both bias and

variance.

• Generally, the ensemble has a similar bias but a lower variance than

a single predictor trained on the original training set.

13

Bagging in sklearnCustom Search

Home Installation Documentation

Examples

Previous Next

[source]

sklearn.ensemble.BaggingClassifier

class sklearn.ensemble.BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)

A Bagging classifier.

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate theirindividual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce thevariance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensembleout of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples,then this algorithm is known as Pasting [Rb1846455d0e5-1]. If samples are drawn with replacement, then the method is known as Bagging[Rb1846455d0e5-2]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as RandomSubspaces [Rb1846455d0e5-3]. Finally, when base estimators are built on subsets of both samples and features, then the method is known asRandom Patches [Rb1846455d0e5-4].

Read more in the User Guide.Parameters: base_estimator : object or None, optional (default=None)

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.

n_estimators : int, optional (default=10)

The number of base estimators in the ensemble.

max_samples : int or float, optional (default=1.0)

The number of samples to draw from X to train each base estimator.If int, then draw max_samples samples.If float, then draw max_samples * X.shape[0] samples.

max_features : int or float, optional (default=1.0)

The number of features to draw from X to train each base estimator.If int, then draw max_features features.If float, then draw max_features * X.shape[1] features.

bootstrap : boolean, optional (default=True)

Whether samples are drawn with replacement. If False, sampling without replacement is performed.

bootstrap_features : boolean, optional (default=False)

Whether features are drawn with replacement.

»

14

Example - Bagging

In [1]:

In [2]:

In [3]:

In [4]:

0.904

0.856

from sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_moons X, y = make_moons(n_samples=500, noise=0.30, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.ensemble import BaggingClassifierfrom sklearn.tree import DecisionTreeClassifier bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1, random_state=42) bag_clf.fit(X_train, y_train)y_pred = bag_clf.predict(X_test)

from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_pred))

tree_clf = DecisionTreeClassifier(random_state=42)tree_clf.fit(X_train, y_train)y_pred_tree = tree_clf.predict(X_test)print(accuracy_score(y_test, y_pred_tree))

15

In [ ]:

In [ ]:

import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.colors import ListedColormap def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True): x1s = np.linspace(axes[0], axes[1], 100) x2s = np.linspace(axes[2], axes[3], 100) x1, x2 = np.meshgrid(x1s, x2s) X_new = np.c_[x1.ravel(), x2.ravel()] y_pred = clf.predict(X_new).reshape(x1.shape) custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0']) plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap) if contour: custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50']) plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8) plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha) plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha) plt.axis(axes) plt.xlabel(r"$x_1$", fontsize=18) plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

fig = plt.figure(figsize=(11,4))plt.subplot(121)plot_decision_boundary(tree_clf, X, y)plt.title("Decision Tree", fontsize=14)plt.subplot(122)plot_decision_boundary(bag_clf, X, y)plt.title("Decision Trees with Bagging", fontsize=14)plt.show()fig.savefig("DT_without_and_with_bagging.pdf", bbox_inches='tight')

16

A single Decision Tree versus a bagging ensemble of 500 trees

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1

1.0

0.5

0.0

0.5

1.0

1.5

x2

Decision Tree

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1

1.0

0.5

0.0

0.5

1.0

1.5

x2

Decision Trees with Bagging

17

Comments

• The BaggingClassifier automatically performs soft voting

instead of hard voting if the base classifier can estimate class

probabilities (i.e., if it has a predict proba() method).

• The ensemble has a comparable bias but a smaller variance.

• Bootstrapping introduces a bit more diversity in the subsets that

each predictor is trained on, so bagging ends up with a slightly

higher bias than pasting.

• However, this also means that predictors end up being less correlated

so the ensemble’s variance is reduced.

• Overall, bagging is generally preferred.

• Cross-validation should be done to evaluate both bagging and

pasting and select the one that works best.

18

Example - Pasting

In [1]:

In [2]:

In [3]:

In [4]:

0.912

0.856

from sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_moons X, y = make_moons(n_samples=500, noise=0.30, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.ensemble import BaggingClassifierfrom sklearn.tree import DecisionTreeClassifier bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42), n_estimators=500, max_samples=100, bootstrap=False, n_jobs=-1, random_state=42) bag_clf.fit(X_train, y_train)y_pred = bag_clf.predict(X_test)

from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_pred))

tree_clf = DecisionTreeClassifier(random_state=42)tree_clf.fit(X_train, y_train)y_pred_tree = tree_clf.predict(X_test)print(accuracy_score(y_test, y_pred_tree))

19

Random Patches and Random

Subspaces

Introduction

• The BaggingClassifier class also supports sampling the features.

This is controlled by two hyper-parameters: max features and

bootstrap features.

• Random Patches: sampling both training instances and features.

• Random Subspaces: keeping all training instances (i.e.,

bootstrap=False and max samples=1.0) but sampling features

(i.e., bootstrap features=True and/or max features smaller

than 1.0).

• Sampling features results in even more predictor diversity, trading a

bit more bias for a lower variance.

• This is useful when dealing with high-dimensional inputs (such as

images).

20

Random Forests

Introduction

• A Random Forest is an ensemble of Decision Trees, generally trained

via the bagging method (or sometimes pasting), typically with

max samples set to the size of the training set.

• Instead of building a BaggingClassifier and passing it a

DecisionTreeClassifier, we can instead use the

RandomForestClassifier and RandomForestRegressor classes.

21

Example

In [11]:

In [12]:

In [13]:

In [ ]:

Out[13]: 0.976

bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42) bag_clf.fit(X_train, y_train)y_pred = bag_clf.predict(X_test)

from sklearn.ensemble import RandomForestClassifier rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42) rnd_clf.fit(X_train, y_train)y_pred_rf = rnd_clf.predict(X_test)

np.sum(y_pred == y_pred_rf) / len(y_pred) # almost identical predictions

22

Boosting

Introduction

• A boosting method trains predictors sequentially, each trying to

correct its predecessor.

• The most popular boosting methods are AdaBoost (Adaptive

Boosting) and Gradient Boosting.

23

1. AdaBoost

• One way for a new predictor to correct its predecessor is to pay a bit

more attention to the training instances that the predecessor

underfitted.

- A first base classifier is trained and used to make predictions on the

training set.

- The relative weight of misclassified training instances is then

increased.

- A second classifier is trained using the updated weights and again it

makes predictions on the training set, weights are updated, etc.

24

Building consecutive predictors

from sklearn.svm import SVC

m = len(X_train)

fig = plt.figure(figsize=(11, 4))for subplot, learning_rate in ((121, 1), (122, 0.5)):

sample_weights = np.ones(m)plt.subplot(subplot)for i in range(5):

svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)svm_clf.fit(X_train, y_train, sample_weight=sample_weights)y_pred = svm_clf.predict(X_train)sample_weights[y_pred != y_train] *= (1 + learning_rate)plot_decision_boundary(svm_clf, X, y, alpha=0.2)plt.title("learning_rate = {}".format(learning_rate), fontsize=16)

if subplot == 121:plt.text(-0.7, -0.50, "1", fontsize=14)plt.text(-0.6, -0.10, "2", fontsize=14)plt.text(-0.5, 0.30, "3", fontsize=14)plt.text(-0.4, 0.55, "4", fontsize=14)plt.text(-0.3, 1.00, "5", fontsize=14)

plt.show()

25

Decision boundaries of consecutive predictors

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1

1.0

0.5

0.0

0.5

1.0

1.5

x2

1

2

34

5

learning_rate = 1

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1

1.0

0.5

0.0

0.5

1.0

1.5

x2

learning_rate = 0.5

• The plot on the right represents the same sequence of predictors

except that the learning rate is halved.

• This means the misclassified instance weights are boosted half as

much at every iteration.

26

AdaBoostClassifier

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200,learning_rate=0.5, random_state=42)

ada_clf.fit(X_train, y_train)plot_decision_boundary(ada_clf, X, y)

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5x1

1.0

0.5

0.0

0.5

1.0

1.5

x2

27

2. Gradient Boosting

• Combines multiple decision trees to create a more powerful model.

• These models can be used for regression and classification.

• Works by building trees in a serial manner, where each tree tries to

correct the mistakes of the previous one.

• Strong pre-pruning is used.

• Often use very shallow trees, of depth 1 to 5.

• Besides the pre-pruning and the number of trees in the ensemble,

another important parameter is the learning rate, which controls

how strongly each tree tries to correct the mistakes of the previous

trees.

28

Example

In [11]:

In [5]:

In [6]:

In [7]:

Accuracy on training set: 1.000 Accuracy on test set: 0.965



import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancercancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

from sklearn.ensemble import GradientBoostingClassifier gbrt = GradientBoostingClassifier(random_state=0)gbrt.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)gbrt.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)gbrt.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

29

Feature importances

In [ ]: gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)gbrt.fit(X_train, y_train) fig = plt.figure(figsize=(8, 6))n_features = cancer.data.shape[1]plt.barh(np.arange(n_features), gbrt.feature_importances_, align='center')plt.yticks(np.arange(n_features), cancer.feature_names)plt.xlabel("Feature importance")plt.ylabel("Feature")plt.ylim(-1, n_features)fig.savefig("gbc_cancer.pdf", bbox_inches='tight')

0.00 0.05 0.10 0.15 0.20 0.25 0.30Feature importance

mean radiusmean texture

mean perimetermean area

mean smoothnessmean compactness

mean concavitymean concave points

mean symmetrymean fractal dimension

radius errortexture error

perimeter errorarea error

smoothness errorcompactness error

concavity errorconcave points error

symmetry errorfractal dimension error

worst radiusworst texture

worst perimeterworst area

worst smoothnessworst compactness

worst concavityworst concave points

worst symmetryworst fractal dimension

Feat

ure

30

More about Gradient Boosting

• Gradient boosted decision trees are among the most powerful and

widely used models for supervised learning.

• They require careful tuning of the parameters and may take a long

time to train.

• The algorithm works well without scaling and on a mixture of binary

and continuous features.

• It often does not work well on high-dimensional sparse data.

• n estimators and learning rate are interconnected (a lower

learning rate means that more trees are needed to build a model

of similar complexity).

• In contrast to random forests, where a higher n estimators value is

always better, increasing n estimators in gradient boosting leads

to a more complex model, which may lead to overfitting.

• Another important parameter is max depth (or alternatively

max leaf nodes), to reduce the complexity of each tree.

31

Questions?

31

DATA 201 - Techniques in Data Science - Lecture 15 ... · Adapted from \Hands-On Machine Learning...

Documents

Transcript of DATA 201 - Techniques in Data Science - Lecture 15 ... · Adapted from \Hands-On Machine Learning...