Supervised Learning: The Setup Decision Trees:Discussionzhe/pdf/lec-5-decision... · Summary:...
Transcript of Supervised Learning: The Setup Decision Trees:Discussionzhe/pdf/lec-5-decision... · Summary:...
1
Decision Trees: Discussion
MachineLearningFall2017
SupervisedLearning:TheSetup
1
Machine LearningSpring 2018
The slides are mainly from Vivek Srikumar
This lecture: Learning Decision Trees
1. Representation: What are decision trees?
2. Algorithm: Learning decision trees– The ID3 algorithm: A greedy heuristic
3. Some extensions
2
Tips and Tricks
1. Decision tree variants
2. Handling examples with missing feature values
3. Non-Boolean features
4. Avoiding overfitting
3
1. Variants of information gain
Information gain is defined using entropy to measure the purity of the labels in the splitted datasets.
Other ways to measure purity
Example: The MajorityError, which computes:“Suppose the tree was not grown below this node and the majority label were chosen, what would be the error?”
Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError? Answer: ¼
Works like entropy
4
1. Variants of information gain
Information gain is defined using entropy to measure the purity of the labels.
Other ways to measure purity
Example: The MajorityError, which computes:“Suppose the tree was not grown below this node and the majority label were chosen, what would be the error?”
Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError? Answer: ¼
5
2. Missing feature values
Suppose an example is missing the value of an attribute. What can we do at training time?
6
Day Outlook Temperature Humidity Wind PlayTennis1 Sunny Hot High Weak No2 Sunny Hot High Strong No8 Sunny Mild ??? Weak No9 Sunny Cool High Weak Yes11 Sunny Mild Normal Strong Yes
2. Missing feature values
Suppose an example is missing the value of an attribute. What can we do at training time?“Complete the example” by
– Using the most common value of the attribute in the data– Using the most common value of the attribute among all
examples with the same output – Using fractional counts of the attribute values
• Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/14 Rain}• Exercise: Will this change probability computations?
7
2. Missing feature values
Suppose an example is missing the value of an attribute. What can we do at training time?“Complete the example” by
– Using the most common value of the attribute in the data– Using the most common value of the attribute among all
examples with the same output – Using fractional counts of the attribute values
• Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/14 Rain}• Exercise: Will this change probability computations?
At test time?
8
2. Missing feature values
Suppose an example is missing the value of an attribute. What can we do at training time?“Complete the example” by
– Using the most common value of the attribute in the data– Using the most common value of the attribute among all
examples with the same output – Using fractional counts of the attribute values
• Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/14 Rain}• Exercise: Will this change probability computations?
At test time? Use the same method
9
3. Non-Boolean features
• If the features can take multiple values– We have seen one edge per value (i.e. a multi-way split)
10
Outlook
Overcast RainSunny
3. Non-Boolean features
• If the features can take multiple values– We have seen one edge per value (i.e. a multi-way split)– Another option: Make the attributes Boolean by testing for
each valueConvert Outlook=Sunny à {Outlook:Sunny=True,
Outlook:Overcast=False, Outlook:Rain=False}
– Or, perhaps group values into disjoint sets
11
3. Non-Boolean features
• If the features can take multiple values– We have seen one edge per value (i.e a multi-way split)– Another option: Make the attributes Boolean by testing for
each valueConvert Outlook=Sunny à {Outlook:Sunny=True,
Outlook:Overcast=False, Outlook:Rain=False}
– Or, perhaps group values into disjoint sets
• For numeric features, use thresholds or ranges to get Boolean/discrete alternatives
12
4. Overfitting
13
The “First Bit” function
• A Boolean function with n inputs• Simply returns the value of the first input, all others
irrelevant
14
What is the decision tree for this function?
X0 X1 YF F FF T FT F TT T T
X1 is irrelvantY = X0
The “First Bit” function
• A Boolean function with n inputs• Simply returns the value of the first input, all others
irrelevant
15
X0 X1 YF F FF T FT F TT T T
What is the decision tree for this function?
X0
T F
T F
The “First Bit” function
• A Boolean function with n inputs
• Simply returns the value of the first input, all others irrelevant
16
X0 X1 YF F F
F T F
T F T
T T T
What is the decision tree for this function?
X0
T F
T F
Exercise: Convince yourself that ID3 will generate this tree
The best case scenario: Perfect data
Suppose we have all 2n examples for training. What will the error be on any future examples?
Zero! Because we have seen every possible input!
And the decision tree can represent the function and ID3 will build a consistent tree
17
The best case scenario: Perfect data
Suppose we have all 2n examples for training. What will the error be on any future examples?
Zero! Because we have seen every possible input!
And the decision tree can represent the function and ID3 will build a consistent tree
18
Noisy data
What if the data is noisy? And we have all 2n examples.
19
X0 X1 X2 YF F F FF F T FF T F FF T T FT F F TT F T TT T F TT T T T
Suppose, the outputs of both training and test sets are randomly corrupted
Train and test sets are no longer identical.
Both have noise, possibly different
Noisy data
What if the data is noisy? And we have all 2n examples.
20
X0 X1 X2 YF F F FF F T FF T F FF T T FT F F TT F T TT T F TT T T T
F
T
Suppose, the outputs of both training and test sets are randomly corrupted
Train and test sets are no longer identical.
Both have noise, possibly different
E.g: Output corrupted with probability 0.25
00.10.20.30.40.50.60.70.80.9
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of features
Test accuracy for different input sizes
21The error bars are generated by running the same experiment multiple times for the same setting
The data is noisy. And we have all 2n examples.
E.g: Output corrupted with probability 0.25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of features
Test accuracy for different input sizes
22
Error = approx 0.375
We can analytically compute test error in this case
Correct prediction:
P(Training example uncorrupted AND test example uncorrupted) = 0.75 ×0.75
P(Training example corrupted AND test example corrupted) = 0.25 × 0.25
P(Correct prediction) = 0.625
Incorrect prediction:
P(Training example uncorrupted AND test example corrupted) = 0.75 × 0.25
P(Training example corrupted and AND example uncorrupted) = 0.25 × 0.75
P(incorrect prediction) = 0.375
The data is noisy. And we have all 2n examples.
E.g: Output corrupted with probability 0.25
00.10.20.30.40.50.60.70.80.9
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of features
Test accuracy for different input sizes
23
What about the training accuracy?
The data is noisy. And we have all 2n examples.
E.g: Output corrupted with probability 0.25
00.10.20.30.40.50.60.70.80.9
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of features
Test accuracy for different input sizes
24
What about the training accuracy?
Training accuracy = 100%Because the learning algorithm willfind a tree that agrees with the data
The data is noisy. And we have all 2n examples.
E.g: Output corrupted with probability 0.25
00.10.20.30.40.50.60.70.80.9
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of features
Test accuracy for different input sizes
25
Then, why is the classifier not perfect?
The data is noisy. And we have all 2n examples.
E.g: Output corrupted with probability 0.25
00.10.20.30.40.50.60.70.80.9
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of features
Test accuracy for different input sizes
26
Then, why is the classifier not perfect?
The classifier overfits the training data
The data is noisy. And we have all 2n examples.
Overfitting
• The learning algorithm fits the noise in the data– Irrelevant attributes or noisy examples influence the
choice of the hypothesis
• May lead to poor performance on future examples
27
Overfitting: One definition
• Data comes from a probability distribution D(X, Y)
• We are using a hypothesis space H
• Errors:– Training error for hypothesis h ∈ H: errortrain(h)– True error for h ∈ H: errorD(h)
• A hypothesis h overfits the training data if there is another hypothesis h’ such that 1. errortrain(h) < errortrain(h’)2. errorD(h) > errorD(h’)
28
Decision trees will overfit
29Graph from Mitchell
Avoiding overfitting with decision trees
• Some approaches:1. Fix the depth of the tree
• Decision stump = a decision tree with only one level • Typically will not be very good by itself• But, short decision trees can make very good features for a second layer
of learning
30
Avoiding overfitting with decision trees
• Some approaches:2. Optimize on a held-out set (also called development set or
validation set) while growing the trees• Split your data into two parts –training set and held-out set• Grow your tree on the training split and check the performance on the
held-out set after every new node is added• If growing the tree hurts validation set performance, stop growing
31
Avoiding overfitting with decision trees
• Some approaches:3. Grow full tree and then prune as a post-processing step in one of
several ways:1. Use a validation set for pruning from bottom up greedily2. Convert the tree into a set of rules (one rule per path from root to leaf)
and prune each rule independently
32
Reminder: Decision trees are rules
• If Color=Blue and Shape=triangle, then output = B• If Color=Red, then output = B
…
33
Summary: Decision trees
• A popular machine learning tool– Prediction is easy– Represents any Boolean functions
• Greedy heuristics for learning– ID3 algorithm (using information gain)
• (Can be used for regression too)
• Decision trees are prone to overfitting unless you take care to avoid it
34