MALIS: Decision treeszuluaga/files/12_Decision_trees_notes.pdf · 2019. 12. 18. · • Depth of a...
Transcript of MALIS: Decision treeszuluaga/files/12_Decision_trees_notes.pdf · 2019. 12. 18. · • Depth of a...
18/12/2019
1
MALIS: Decision trees
Maria A. Zuluaga
Data Science Department
Motivation: k-NN
MALIS 2019 2
• A new test point arrives
• What to do?
• Anything wrong with that?
source: scikit-learn
18/12/2019
2
Motivation : Improving k-NN
• Can you think of a way of making k-NN faster?
MALIS 2019 3
Decision trees: an example
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 175 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 4
Adapted from CMU ML course
18/12/2019
3
Decision trees
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 5
Adapted from CMU ML course
smoke
Mask Ears
Height>
175
Y
YY
YN
N
N
N
Evil
Evil
Evil
Good
Good
Does it classify well the training data?
Decision trees
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 6
Adapted from CMU ML course
smoke
Mask Ears
Height>
175
Y
YY
YN
N
N
N
Evil
Evil
Evil
Good
Good
18/12/2019
4
Decision trees
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 7
Adapted from CMU ML course
smoke
Mask Ears
Height>
175
Y
YY
YN
N
N
N
Evil
Evil
Evil
Good
Good
Decision trees
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 8
Adapted from CMU ML course
smoke
Mask Ears
Height>
175
Y
YY
YN
N
N
N
Evil
Evil
Evil
Good
Good
18/12/2019
5
Decision trees
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 9
Adapted from CMU ML course
smoke
Mask Ears
Height>
175
Y
YY
YN
N
N
N
Evil
Evil
Evil
Good
Good
Can we fix this?
Quiz: What is the smallest tree?
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 10
Adapted from CMU ML course
smoke
Mask Ears
Height>
175
Y
YY
YN
N
N
N
Evil
Evil
Evil
Good
Good
18/12/2019
6
Smallest tree
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 11
Adapted from CMU ML course
Cape
Height
> 178
Y
YN
N
Evil
Good
Good
Smallest tree
Mask Cape Tie Pointy
ears
Smokes Height Class
Batman Y Y N Y N 185 Good
Robin Y Y N Y N 176 Good
Alfred N N Y N N 180 Good
Penguin N N Y N Y 145 Evil
Catwoman Y N N Y N 170 Evil
Joker N N N N N 177 Evil
Batgirl Y Y N Y N 165 ?
Riddler Y Y N N N 178 ?
Your future boss N Y Y Y Y 177 ?
MALIS 2019 12
Adapted from CMU ML course
Cape
Height
>178
Y
YN
N
Evil
Good
Good
18/12/2019
7
Decision trees
• Learning the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76]
• Resort to a greedy approach:
• Start from empty decision tree
• Split on next best feature
• Recurse
• Goal: Build a maximally compact tree with only pure leaves
MALIS 2019 13
Impurity functions
• Greedy strategy: We keep splitting the data to minimize an impurity function that measures label purity amongst the children.
• Definitions:
• Data � = ��, �� , … , ��, �� , � ∈ {1, … , }• : total number of classes
• �� ⊆ �, where �� = { �, � ∈ �: � = �}• � = �� ∪ �� ∪ ⋯ ��• �� = |��|
|�| ← fraction of inputs in S with label k
• Gini impurity:
� � = � ��(1 − ��)�
�"�MALIS 2019 14
18/12/2019
8
Gini impurity
• If K = 2:� � = �� 1 − �� + 1 − �� �� = 2�(1 − �)
MALIS 2019 15
G(p)
p
Entropy
• What is the worst case scenario for a leave of the tree?
• Kullback-Liebler (KL-)divergence
&(�| ' = � �� log ��'
�
�"�
MALIS 2019 16
min. /(�)
&(� ∥ 1 ) = � �� log �� + � �� log
1
�
1
� &(� ∥ 1
) = � �� log �� + log � ��1
�
1
� &(� ∥ 1
) = � �� log �� + log 1
�
max4 &(� ∥ 1 ) = max4 � �� log ��
1
�
Now remember, we prefer minimization than maximization so:
max4 &(� ∥ 1 ) = min. − � �� log ��
1
�
18/12/2019
9
Entropy of a tree
MALIS 2019 17
S
SL SR
�5 = �6 =
/ � =
The algorithm
• Compute the impurity function for every attribute of dataset S
• Split the set S into subsets using the attribute for which the resulting impurity function is minimal after splitting (greedy approach)
• Make a decision tree node containing that attribute
• Recurse on subsets with remaining
MALIS 2019 18
How we stop?
18/12/2019
10
Stopping criteria (ID3 algorithm)
1. All data points in the subset have the same output
2. There are no more features to consider for the split
MALIS 2019 19
MALIS 2019 2005_trees.ipynb
18/12/2019
11
MALIS 2019 2105_trees.ipynb
What if no split improves impurity?The XOR
• Two features: a, b
• H(S)=
• First round
• Use a for split
• Impurity:
• Use b for split
• Impurity:
• H(S)=
MALIS 2019 22
0 1
1
1x
2x XOR
• First split does not improve things
• Decision trees suffer of myopia
• Alternative: Tree pruning
18/12/2019
12
The XOR
MALIS 2019 2305_trees.ipynb
Regression trees
• CART: Classification and regression trees
• Change of impurity function
ℒ � = 1|�| � � − �8� �
1
9,: ∈�where �8� average � in set S.
MALIS 2019 24
18/12/2019
13
Overfitting
MALIS 2019 2505_trees.ipynb
MALIS 2019 2605_trees.ipynb
18/12/2019
14
How to fix it?
• Must use tricks to find “simple trees”
• Fixed depth/Early stopping
• Pruning
• Use ensembles of different trees:
• Random forests
MALIS 2019 27
Parametric vs. non-parametric
• So far:• Generative vs. discriminative
• Probabilistic vs. non-probabilistic
• Parametric algorithm: Has a constant set of parameters which is independent of the number of training samples
• Examples: perceptron, logistic regression
• The dimension of w depends on the dimension of the training data but not on the number of samples
• Non-parametric algorithm: Scales as a function of the training samples
• Example: K-NN
MALIS 2019 28
18/12/2019
15
Decision trees: Parametric or not?
• Non-parametric: If trained in full depth
• Depth of a decision tree scales as a function of the training data
• ;(log� <)
• Parametric: If the tree depth is limited by a maximum
• Upper bound of the model size is known prior to observing the training data
MALIS 2019 29
Exercise: Classify other algorithms covered as parametric or non-parametric
MALIS: Ensembles
Maria A. Zuluaga
Data Science Department
18/12/2019
16
MALIS 2019 31Source: https://djsaunde.wordpress.com/2017/07/17/the-bias-variance-tradeoff/
Reducing variance without increasing bias
• We want to reduce the variance
• For that matter: ℎ> � → ℎ8 �• How?
• Averaging
MALIS 2019 32
@ (ℎ> � − � �] = @ (ℎ> � − ℎ8(�) �] + @[ �8 � − � �] + @[ ℎ8 � − �8 � �]
18/12/2019
17
Weak law of large numbers
• Roughly it says that for i.i.d. RV C with mean C̅:
1E � C C → C̅
F
"�• Apply this to classifiers:
1. E datasets available G�, … , GF drawn from H�
2. Train a classifier on each dataset and then average:
ℎI = 1E � ℎJ C → ℎ8(C)
F
"�
MALIS 2019 33
as E → ∞
as E → ∞
Problem?
Ensemble of classifiers:
Solution: Bagging (Bootstrap aggregating)
• Leo Breiman (1994)
• Idea: Take repeated bootstrap samples from training set G.
• Bootstrap sampling: Given set D containing N training examples, create D’ by drawing N examples at random with replacement from D.
• Algorithm:
1. Create k bootstrap samples G�, … , G�2. Train a classifier on each G3. Classify new instance by majority vote or average
MALIS 2019 34
18/12/2019
18
Parenthesis: Bootstapping
• In statistics, bootstrapping is any test or metric that relies on random sampling with replacement.
• Bootstrap as a metaphor, meaning to better oneself by one's own unaided efforts, was in use in 1922. This metaphor spawned additional metaphors for a series of self-sustaining processes that proceed without external help.
MALIS 2019 35
Source: Wikipedia
“Pull yourself out with by own bootstraps”
MALIS 2019 36Source: CSAIL ML course
18/12/2019
19
Analysis
• Notice that in bagging ℎI = �F ∑ ℎJ C ↛ ℎ8(C)F"�
• Question: Why?
• Weak law of large numbers cannot be applied
• Still, they are efficient at reducing the variance
• Also, although it cannot be proven that new samples are i.i.d, it can be proven that they are drawn from the original distribution P (exercise)
MALIS 2019 37
MALIS 2019 38
18/12/2019
20
Bagging
Advantages
• Easy to implement
• Reduces variance keeping bias unaltered
• As prediction is result of average of many classifiers, one obtains a score and a variance that can be interpreted as uncertainty
• Out of bag error
Disadvantages
• Computationally more expensive
• Correlated training sets
MALIS 2019 39
Out of bag error
• Bagging provides and unbiased estimate of the test error
MALIS 2019 40
18/12/2019
21
Out of bag error: Formalization
• For each � , � ∈ G lets define � as the set of all training sets G�which does not contain each ��, �� :
� = �| � , � ∉ G�• Let the averaged classifier over all these datasets be:
ℎO � = 1�
� ℎP �1
P∈�Q• The out-of-bag error becomes:
RSST = 1< � U(ℎO �� , ��)
1
��,:� ∈JMALIS 2019 41
Random Forests
• Ensemble method specifically designed for decision tree classifiers
• It introduces two sources of randomness:
1. Bagging
2. Random input vectors
• Bagging: Each tree is grown using a bootstrap sample of the training data
• Random vector method: At each node, best split is chosen from a random sample of m attributes instead of all of them
• Alleviates correlation among bootstrap samples
MALIS 2019 42
18/12/2019
22
Algorithm
1. Sample E datasets G�, … , GF from G with replacement
2. For each GP train a full decision tree ℎP (max_depth = ∞) with a
small modification
• Before each split, randomly select � < W features (without replacement)
• Consider only these features for the split (increases variance of trees)
3. Final classifier:
ℎ C = 1E � ℎP C
F
P"�
MALIS 2019 43
MALIS 2019 44Source: Hastie et al, ESL
18/12/2019
23
Tips
• One of the best, most popular and easiest to use out-of-the-box classifier. Two reasons for this:
1. The RF only has two hyper-parameters E and �. It is quite insensitive to these. A good choice for these is:
• For �: � = W1
• For E: as large as possible
2. Decision trees do not require a lot of preprocessing.
• The features can be of different scale, magnitude, or slope.
• Advantageous in scenarios with heterogeneous data, which is recorded in completely different units
MALIS 2019 45
Boosting• Another ensemble of classifiers technique
• Scenario: Hypothesis class ℍ, whose set of classifiers has a large bias and a high training error
• Question: Can weak learners Y be combined to generate a strong learner? – Michael Kearns (Prof UPenn) in his ML class project, 1988
• Answer: Yes – Robert Schapire in 1990
• Weak learner: One whose error rate is only slightly better than random guessing.
• Strong learner: One who is arbitrarily well-correlated with the true classification
MALIS 2019 46
18/12/2019
24
Boosting: How to?
• Create an ensemble classifier / CZ = ∑ [\ℎ\(CZ)]\"� in an iterative fashion
• At each iteration t, the classifier [\ℎ\(CZ) is added to the ensemble
• At test time, all classifiers are evaluated and return the weighted sum
• Process similar to gradient descent.
• Instead of updating model parameters at each iteration, functions are added to the ensemble
MALIS 2019 47
MALIS 2019 48
Schematic view of Adaboost
Source: Hastie et al, ESL
18/12/2019
25
MALIS 2019 49
err < 0.5
Gradient boosting trees
MALIS 2019 50
18/12/2019
26
Boosting, Kaggle competitions & reproducibilityFinal note
MALIS 2019 51
MALIS 201952
Source:
Hastie et al. ELS
MARS: Multivariate
Adaptive
regression splines
18/12/2019
27
Recap
• We have covered a large set of supervised learning methods
• In this last lecture:
• Trees
• Ensemble methods
• Some of the best out of the box methods
• We have identified some of the problems these methods can present
• And presented ways to address them
MALIS 2019 53
Further reading and useful material
Source Notes
Elements of Statistical Learning Ch 9, 10, 15, 16
MALIS 2019 54