MALIS: Decision treeszuluaga/files/12_Decision_trees_notes.pdf · 2019. 12. 18. · • Depth of a...

18/12/2019

1

MALIS: Decision trees

Maria A. Zuluaga

Data Science Department

Motivation: k-NN

MALIS 2019 2

• A new test point arrives

• What to do?

• Anything wrong with that?

source: scikit-learn

18/12/2019

2

Motivation : Improving k-NN

• Can you think of a way of making k-NN faster?

MALIS 2019 3

Decision trees: an example

Mask Cape Tie Pointy

ears

Smokes Height Class

Batman Y Y N Y N 185 Good

Robin Y Y N Y N 175 Good

Alfred N N Y N N 180 Good

Penguin N N Y N Y 145 Evil

Catwoman Y N N Y N 170 Evil

Joker N N N N N 177 Evil

Batgirl Y Y N Y N 165 ?

Riddler Y Y N N N 178 ?

Your future boss N Y Y Y Y 177 ?

MALIS 2019 4

Adapted from CMU ML course

18/12/2019

3

Decision trees


ears

Smokes Height Class










MALIS 2019 5


smoke

Mask Ears

Height>

175

Y

YY

YN

N

N

N

Evil

Evil

Evil

Good

Good

Does it classify well the training data?

Decision trees


ears

Smokes Height Class










MALIS 2019 6


smoke

Mask Ears

Height>

175

Y

YY

YN

N

N

N

Evil

Evil

Evil

Good

Good

18/12/2019

4

Decision trees


ears

Smokes Height Class










MALIS 2019 7


smoke

Mask Ears

Height>

175

Y

YY

YN

N

N

N

Evil

Evil

Evil

Good

Good

Decision trees


ears

Smokes Height Class










MALIS 2019 8


smoke

Mask Ears

Height>

175

Y

YY

YN

N

N

N

Evil

Evil

Evil

Good

Good

18/12/2019

5

Decision trees


ears

Smokes Height Class










MALIS 2019 9


smoke

Mask Ears

Height>

175

Y

YY

YN

N

N

N

Evil

Evil

Evil

Good

Good

Can we fix this?

Quiz: What is the smallest tree?


ears

Smokes Height Class










MALIS 2019 10


smoke

Mask Ears

Height>

175

Y

YY

YN

N

N

N

Evil

Evil

Evil

Good

Good

18/12/2019

6

Smallest tree


ears

Smokes Height Class










MALIS 2019 11


Cape

Height

> 178

Y

YN

N

Evil

Good

Good

Smallest tree


ears

Smokes Height Class










MALIS 2019 12


Cape

Height

>178

Y

YN

N

Evil

Good

Good

18/12/2019

7

Decision trees

• Learning the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76]

• Resort to a greedy approach:

• Start from empty decision tree

• Split on next best feature

• Recurse

• Goal: Build a maximally compact tree with only pure leaves

MALIS 2019 13

Impurity functions

• Greedy strategy: We keep splitting the data to minimize an impurity function that measures label purity amongst the children.

• Definitions:

• Data � = ��, �� , … , ��, �� , � ∈ {1, … , }• : total number of classes

• �� ⊆ �, where �� = { �, � ∈ �: � = �}• � = �� ∪ �� ∪ ⋯ ��• �� = |��|

|�| ← fraction of inputs in S with label k

• Gini impurity:

� � = � ��(1 − ��)�

�"�MALIS 2019 14

18/12/2019

8

Gini impurity

• If K = 2:� � = �� 1 − �� + 1 − �� = 2�(1 − �)

MALIS 2019 15

G(p)

p

Entropy

• What is the worst case scenario for a leave of the tree?

• Kullback-Liebler (KL-)divergence

&(�| ' = � �� log ��'

�

�"�

MALIS 2019 16

min. /(�)

&(� ∥ 1 ) = � �� log �� + � �� log

1

�

1

� &(� ∥ 1

) = � �� log �� + log � ��1

�

1

� &(� ∥ 1

) = � �� log �� + log 1

�

max4 &(� ∥ 1 ) = max4 � �� log ��

1

�

Now remember, we prefer minimization than maximization so:

max4 &(� ∥ 1 ) = min. − � �� log ��

1

�

18/12/2019

9

Entropy of a tree

MALIS 2019 17

S

SL SR

�5 = �6 =

/ � =

The algorithm

• Compute the impurity function for every attribute of dataset S

• Split the set S into subsets using the attribute for which the resulting impurity function is minimal after splitting (greedy approach)

• Make a decision tree node containing that attribute

• Recurse on subsets with remaining

MALIS 2019 18

How we stop?

18/12/2019

10

Stopping criteria (ID3 algorithm)

1. All data points in the subset have the same output

2. There are no more features to consider for the split

MALIS 2019 19

MALIS 2019 2005_trees.ipynb

18/12/2019

11


What if no split improves impurity?The XOR

• Two features: a, b

• H(S)=

• First round

• Use a for split

• Impurity:

• Use b for split

• Impurity:

• H(S)=

MALIS 2019 22

0 1

1

1x

2x XOR

• First split does not improve things

• Decision trees suffer of myopia

• Alternative: Tree pruning

18/12/2019

12

The XOR


Regression trees

• CART: Classification and regression trees

• Change of impurity function

ℒ � = 1|�| � � − �8� �

1

9,: ∈�where �8� average � in set S.

MALIS 2019 24

18/12/2019

13

Overfitting



18/12/2019

14

How to fix it?

• Must use tricks to find “simple trees”

• Fixed depth/Early stopping

• Pruning

• Use ensembles of different trees:

• Random forests

MALIS 2019 27

Parametric vs. non-parametric

• So far:• Generative vs. discriminative

• Probabilistic vs. non-probabilistic

• Parametric algorithm: Has a constant set of parameters which is independent of the number of training samples

• Examples: perceptron, logistic regression

• The dimension of w depends on the dimension of the training data but not on the number of samples

• Non-parametric algorithm: Scales as a function of the training samples

• Example: K-NN

MALIS 2019 28

18/12/2019

15

Decision trees: Parametric or not?

• Non-parametric: If trained in full depth

• Depth of a decision tree scales as a function of the training data

• ;(log� <)

• Parametric: If the tree depth is limited by a maximum

• Upper bound of the model size is known prior to observing the training data

MALIS 2019 29

Exercise: Classify other algorithms covered as parametric or non-parametric

MALIS: Ensembles

Maria A. Zuluaga

Data Science Department

18/12/2019

16

MALIS 2019 31Source: https://djsaunde.wordpress.com/2017/07/17/the-bias-variance-tradeoff/

Reducing variance without increasing bias

• We want to reduce the variance

• For that matter: ℎ> � → ℎ8 �• How?

• Averaging

MALIS 2019 32

@ (ℎ> � − � �] = @ (ℎ> � − ℎ8(�) �] + @[ �8 � − � �] + @[ ℎ8 � − �8 � �]

18/12/2019

17

Weak law of large numbers

• Roughly it says that for i.i.d. RV C with mean C̅:

1E � C C → C̅

F

"�• Apply this to classifiers:

1. E datasets available G�, … , GF drawn from H�

2. Train a classifier on each dataset and then average:

ℎI = 1E � ℎJ C → ℎ8(C)

F

"�

MALIS 2019 33

as E → ∞

as E → ∞

Problem?

Ensemble of classifiers:

Solution: Bagging (Bootstrap aggregating)

• Leo Breiman (1994)

• Idea: Take repeated bootstrap samples from training set G.

• Bootstrap sampling: Given set D containing N training examples, create D’ by drawing N examples at random with replacement from D.

• Algorithm:

1. Create k bootstrap samples G�, … , G�2. Train a classifier on each G3. Classify new instance by majority vote or average

MALIS 2019 34

18/12/2019

18

Parenthesis: Bootstapping

• In statistics, bootstrapping is any test or metric that relies on random sampling with replacement.

• Bootstrap as a metaphor, meaning to better oneself by one's own unaided efforts, was in use in 1922. This metaphor spawned additional metaphors for a series of self-sustaining processes that proceed without external help.

MALIS 2019 35

Source: Wikipedia

“Pull yourself out with by own bootstraps”

MALIS 2019 36Source: CSAIL ML course

18/12/2019

19

Analysis

• Notice that in bagging ℎI = �F ∑ ℎJ C ↛ ℎ8(C)F"�

• Question: Why?

• Weak law of large numbers cannot be applied

• Still, they are efficient at reducing the variance

• Also, although it cannot be proven that new samples are i.i.d, it can be proven that they are drawn from the original distribution P (exercise)

MALIS 2019 37

MALIS 2019 38

18/12/2019

20

Bagging

Advantages

• Easy to implement

• Reduces variance keeping bias unaltered

• As prediction is result of average of many classifiers, one obtains a score and a variance that can be interpreted as uncertainty

• Out of bag error

Disadvantages

• Computationally more expensive

• Correlated training sets

MALIS 2019 39

Out of bag error

• Bagging provides and unbiased estimate of the test error

MALIS 2019 40

18/12/2019

21

Out of bag error: Formalization

• For each � , � ∈ G lets define � as the set of all training sets G�which does not contain each ��, �� :

� = �| � , � ∉ G�• Let the averaged classifier over all these datasets be:

ℎO � = 1�

� ℎP �1

P∈�Q• The out-of-bag error becomes:

RSST = 1< � U(ℎO �� , ��)

1

��,:� ∈JMALIS 2019 41

Random Forests

• Ensemble method specifically designed for decision tree classifiers

• It introduces two sources of randomness:

1. Bagging

2. Random input vectors

• Bagging: Each tree is grown using a bootstrap sample of the training data

• Random vector method: At each node, best split is chosen from a random sample of m attributes instead of all of them

• Alleviates correlation among bootstrap samples

MALIS 2019 42

18/12/2019

22

Algorithm

1. Sample E datasets G�, … , GF from G with replacement

2. For each GP train a full decision tree ℎP (max_depth = ∞) with a

small modification

• Before each split, randomly select � < W features (without replacement)

• Consider only these features for the split (increases variance of trees)

3. Final classifier:

ℎ C = 1E � ℎP C

F

P"�

MALIS 2019 43

MALIS 2019 44Source: Hastie et al, ESL

18/12/2019

23

Tips

• One of the best, most popular and easiest to use out-of-the-box classifier. Two reasons for this:

1. The RF only has two hyper-parameters E and �. It is quite insensitive to these. A good choice for these is:

• For �: � = W1

• For E: as large as possible

2. Decision trees do not require a lot of preprocessing.

• The features can be of different scale, magnitude, or slope.

• Advantageous in scenarios with heterogeneous data, which is recorded in completely different units

MALIS 2019 45

Boosting• Another ensemble of classifiers technique

• Scenario: Hypothesis class ℍ, whose set of classifiers has a large bias and a high training error

• Question: Can weak learners Y be combined to generate a strong learner? – Michael Kearns (Prof UPenn) in his ML class project, 1988

• Answer: Yes – Robert Schapire in 1990

• Weak learner: One whose error rate is only slightly better than random guessing.

• Strong learner: One who is arbitrarily well-correlated with the true classification

MALIS 2019 46

18/12/2019

24

Boosting: How to?

• Create an ensemble classifier / CZ = ∑ [\ℎ\(CZ)]\"� in an iterative fashion

• At each iteration t, the classifier [\ℎ\(CZ) is added to the ensemble

• At test time, all classifiers are evaluated and return the weighted sum

• Process similar to gradient descent.

• Instead of updating model parameters at each iteration, functions are added to the ensemble

MALIS 2019 47

MALIS 2019 48

Schematic view of Adaboost

Source: Hastie et al, ESL

18/12/2019

25

MALIS 2019 49

err < 0.5

Gradient boosting trees

MALIS 2019 50

18/12/2019

26

Boosting, Kaggle competitions & reproducibilityFinal note

MALIS 2019 51

MALIS 201952

Source:

Hastie et al. ELS

MARS: Multivariate

Adaptive

regression splines

18/12/2019

27

Recap

• We have covered a large set of supervised learning methods

• In this last lecture:

• Trees

• Ensemble methods

• Some of the best out of the box methods

• We have identified some of the problems these methods can present

• And presented ways to address them

MALIS 2019 53

Further reading and useful material

Source Notes

Elements of Statistical Learning Ch 9, 10, 15, 16

MALIS 2019 54

MALIS: Decision treeszuluaga/files/12_Decision_trees_notes.pdf · 2019. 12. 18. · • Depth of a...

Documents

Transcript of MALIS: Decision treeszuluaga/files/12_Decision_trees_notes.pdf · 2019. 12. 18. · • Depth of a...