Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision...

16
Today’s Topics • Ensembles • Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 1

Transcript of Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision...

Page 1: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

Lecture 1, Slide 1

Today’s Topics

• Ensembles

• Decision Forests (actually, Random Forests)

• Bagging and Boosting

• Decision Stumps

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Page 2: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Ensembles(Bagging, Boosting, and all that)

Old View– Learn one good model

New View– Learn a good set of models

Probably best example of interplay between ‘theory & practice’ in machine learning

Naïve Bayes, k-NN, neural net,d-tree, SVM, etc

9/22/15 2

Page 3: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Ensembles of Neural Networks(or any supervised learner)

• Ensembles often produce accuracy gains of 5-10 percentage points!

• Can combine “classifiers” of various types– Eg, decision trees, rule sets, neural networks, etc.

Network Network Network

INPUT

Combiner

OUTPUT

3

Page 4: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

Three Explanations of Why Ensembles Help

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

1.Statistical (sample effects)

2.Computational (limited cycles for search)

3.Representational(wrong hypothesis space)

From: Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408

Key true concept learned models search path

Concept Space Considered

4

Page 5: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Combining Multiple ModelsThree ideas for combining predictions

1. Simple (unweighted) votes• Standard choice

2. Weighted votes• eg, weight by tuning-set accuracy

3. Learn a combining function• Prone to overfitting?• ‘Stacked generalization’ (Wolpert)

5

Page 6: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Random Forests(Breiman, Machine Learning 2001; related to Ho, 1995)

A variant of something called BAGGING (‘multi-sets’)

Algorithm

Repeat k times

(1) Draw with replacement N examples, put in train set

(2) Build d-tree, but in each recursive call– Choose (w/o replacement) i features– Choose best of these i as the root

of this (sub)tree

(3) Do NOT prune

In HW2, we’ll give you 101 ‘bootstrapped’ samples of the Thoracic Surgery Dataset

Let N = # of examples F = # of features i = some number << F

9/22/15 6

Page 7: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Using Random Forests

After training we have K decision trees

How to use on TEST examples?Some variant of If at least L of these K trees say ‘true’ then output ‘true’

How to choose L ?Use a tune set to decide

9/22/15 7

Page 8: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

More on Random Forests• Increasing i

– Increases correlation among individual trees (BAD)– Also increases accuracy of individual trees (GOOD)

• Can also use tuning set to choose good value for i

• Overall, random forests– Are very fast (eg, 50K examples, 10 features,

10 trees/min on 1 GHz CPU back in 2004)– Deal well with large # of features– Reduce overfitting substantially; NO NEED TO PRUNE!– Work very well in practice

9/22/15 8

Page 9: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

A Relevant Early Paper on ENSEMBLESHansen & Salamen, PAMI:20, 1990

– If (a) the combined predictors have errors that are independent from one another

– And (b) prob any given model correct predicts any given testset example is > 50%, then

0)predictors of rateerror set test (lim

NN

9

Page 10: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Some MoreRelevant Early Papers

• Schapire, Machine Learning:5, 1990 (‘Boosting’)– If you have an algorithm that gets > 50% on any

distribution of examples, you can create an algorithm that gets > (100% - ), for any > 0

– Need an infinite (or very large, at least) source of examples

- Later extensions (eg, AdaBoost) address this weakness

• Also see Wolpert, ‘Stacked Generalization,’Neural Networks, 1992

10

Page 11: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Some Methods for Producing ‘Uncorrelated’ Members of an Ensemble

• K times randomly choose (with replacement) N examples from a training set of size N

• Give each training set to a std ML algo– ‘Bagging’ by Brieman (Machine Learning, 1996)

– Want unstable algorithms (so learned models vary)

• Reweight examples each cycle (if wrong, increase weight; else decrease weight)– ‘AdaBoosting’ by Freund & Schapire (1995, 1996)

11

Page 12: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

Empirical Studies (from Freund & Schapire; reprinted in Dietterich’s AI Magazine paper)

Error Rate of C4.5

Error Rate of Bagged (Boosted) C4.5

(Each point one data set)

Boosting and Bagging helped almost always!

Error Rate of AdaBoost

Error Rate of Bagging

On average, Boosting slightly better?

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 12

ID3 successor

Page 13: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Some More Methods for Producing “Uncorrelated” Members of an Ensemble

• Directly optimize accuracy + diversity– Opitz & Shavlik (1995; used genetic algo’s)– Melville & Mooney (2004-5)

• Different number of hidden units in a neural network, different k in k -NN, tie-breaking scheme, example ordering, diff ML algos, etc– Various people– See 2005-2008 papers of Rich Caruana’s group

for large-scale empirical studies of ensembles

13

Page 14: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

Boosting/Bagging/etc Wrapup • An easy to use and usually highly

effective technique- always consider it (Bagging, at least) when

applying ML to practical problems

• Does reduce ‘comprehensibility’ of models- see work by Craven & Shavlik though (‘rule extraction’)

• Increases runtime, but cycles usually much cheaper than examples (and easily parallelized)

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 14

Page 15: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Decision “Stumps”(formerly part of HW; try on your own!)

• Holte (ML journal) compared:– Decision trees with only one decision (decision stumps)

vs– Trees produced by C4.5 (with pruning algorithm used)

• Decision ‘stumps’ do remarkably well on UC Irvine data sets– Archive too easy? Some datasets seem to be

• Decision stumps are a ‘quick and dirty control for comparing to new algorithms– But ID3/C4.5 easy to use and probably a better control

15

Page 16: Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture.

9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

C4.5 Compared to 1R (‘Decision Stumps’)

See Holte paper in Machine Learning for key (eg, HD=heart disease)

Dataset C4.5 1R

BC 72.0% 68.7%

CH 99.2% 68.7%

GL 63.2% 67.6%

G2 74.3% 53.8%

HD 73.6% 72.9%

HE 81.2% 76.3%

HO 83.6% 81.0%

HY 99.1% 97.2%

IR 93.8% 93.5%

LA 77.2% 71.5%

LY 77.5% 70.7%

MU 100.0% 98.4%

SE 97.7% 95.0%

SO 97.5% 81.0%

VO 95.6% 95.2%

V1 89.4% 86.8%

Testset Accuracy

16