CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke...
-
Upload
robert-stevenson -
Category
Documents
-
view
221 -
download
0
Transcript of CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke...
![Page 1: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/1.jpg)
CSE 446: Decision TreesWinter 2012
Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld
![Page 2: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/2.jpg)
5
Overview of Learning
What is Being Learned?
Type of Supervision (eg, Experience, Feedback)
LabeledExamples
Reward Nothing
Discrete Function
Classification Clustering
Continuous Function
Regression
Policy Apprenticeship Learning
ReinforcementLearning
![Page 3: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/3.jpg)
A learning problem: predict fuel efficiency
From the UCI repository (thanks to Ross Quinlan)
• 40 Records
• Discrete data (for now)
• Predict MPG
mpg cylinders displacement horsepower weight acceleration modelyear maker
good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe
![Page 4: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/4.jpg)
Hypotheses: decision trees f : X Y
• Each internal node tests an attribute xi
• Each branch assigns an attribute value xi=v
• Each leaf assigns a class y
• To classify input x?
traverse the tree from root to leaf, output the labeled y
Cylinders
3 4 5 6 8
good bad badMaker Horsepower
low med highamerica asia europe
bad badgoodgood goodbad
![Page 5: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/5.jpg)
Hypothesis space• How many possible
hypotheses?
• What functions can be represented?
• How many will be consistent with a given dataset?
• How will we choose the best one?
mpg cylinders displacement horsepower weight acceleration modelyear maker
good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe
Cylinders
3 4 5 6 8
good bad badMaker Horsepower
low med highamerica asia europe
bad badgoodgood goodbad
![Page 6: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/6.jpg)
Are all decision trees equal?• Many trees can represent the same concept• But, not all trees will have the same size!
e.g., = (A ∧ B) (A ∧ C)
A
B C
t
t
f
f
+ _
t f
+ _
• Which tree do we prefer?
B
C C
t f
f
+t f
+ _
At f
A
_ +
_
t t f
![Page 7: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/7.jpg)
Learning decision trees is hard!!!
• Finding the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76]
• Resort to a greedy heuristic:– Start from empty decision tree– Split on next best attribute (feature)– Recurse
![Page 8: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/8.jpg)
Improving Our Tree
predict mpg=bad
![Page 9: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/9.jpg)
Recursive Step
Records in which cylinders
= 4
Records in which cylinders
= 5
Records in which cylinders
= 6
Records in which cylinders
= 8
Build tree fromThese records..
Build tree fromThese records..
Build tree fromThese records..
Build tree fromThese records..
![Page 10: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/10.jpg)
A full tree
![Page 11: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/11.jpg)
Two Questions
1. Which attribute gives the best split?2. When to stop recursion?
Greedy Algorithm:– Start from empty decision tree– Split on the best attribute (feature)– Recurse
![Page 12: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/12.jpg)
30
Which attribute gives the best split?
A1: The one with the highest information gain
Defined in terms of entropy
A2: Actually many alternatives, eg, accuracy
Seeks to reduce the misclassification rate
![Page 13: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/13.jpg)
EntropyEntropy H(Y) of a random variable Y
More uncertainty, more entropy!
Information Theory interpretation:
H(Y) is the expected number of bits
needed to encode a randomly drawn
value of Y (under most efficient code)
![Page 14: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/14.jpg)
Entropy Example
X1 X2 Y
T T T
T F T
T T T
T F T
F T T
F F F
P(Y=t) = 5/6
P(Y=f) = 1/6
H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6
= 0.65
![Page 15: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/15.jpg)
Conditional EntropyConditional Entropy H( Y |X) of a random variable Y conditioned on
a random variable X
X1
Y=t : 4Y=f : 0
t f
Y=t : 1Y=f : 1
P(X1=t) = 4/6
P(X1=f) = 2/6
X1 X2 Y
T T T
T F T
T T T
T F T
F T T
F F F
Example:
H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)
- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2)
= 2/6
= 0.33
![Page 16: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/16.jpg)
Information Gain
Advantage of attribute – decrease in entropy (uncertainty) after splitting
X1 X2 Y
T T T
T F T
T T T
T F T
F T T
F F F
In our running example:
IG(X1) = H(Y) – H(Y|X1) = 0.65 – 0.33
IG(X1) > 0 we prefer the split!
![Page 17: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/17.jpg)
37
Alternate Splitting Criteria
• Misclassification ImpurityMinimum probability that a training pattern will be misclassified M(Y) = 1-max P(Y=yi)
• Misclassification Gain IGM(X) = [1-max P(Y=yi)] - [1-(max max P(Y=yi| x=xj))]
ji
i
i
![Page 18: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/18.jpg)
Learning Decision Trees
• Start from empty decision tree• Split on next best attribute (feature)
– Use information gain (or…?) to select attribute:
• Recurse
![Page 19: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/19.jpg)
Suppose we want to predict MPG
Now, Look at all the information gains…
predict mpg=bad
![Page 20: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/20.jpg)
Tree After One Iteration
![Page 21: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/21.jpg)
©Carlos Guestrin 2005-2009 41
When to Terminate?
![Page 22: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/22.jpg)
Base Case One
Don’t split a node if all matching
records have the same
output value
![Page 23: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/23.jpg)
Base Case Two
Don’t split a node if none
of the attributes can
create multiple
[non-empty] children
![Page 24: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/24.jpg)
Base Case Two: No attributes can
distinguish
![Page 25: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/25.jpg)
Base Cases: An idea• Base Case One: If all records in current data subset
have the same output then don’t recurse• Base Case Two: If all records have exactly the same
set of input attributes then don’t recurse
Proposed Base Case 3:If all attributes have zero
information gain then don’t recurse
Is this a good idea?
![Page 26: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/26.jpg)
The problem with Base Case 3a b y
0 0 00 1 11 0 11 1 0
y = a XOR b
The information gains: The resulting decision tree:
![Page 27: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/27.jpg)
But Without Base Case 3:
y = a XOR bThe resulting decision tree:
So: Base Case 3? Include or Omit?
a b y0 0 00 1 11 0 11 1 0
![Page 28: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/28.jpg)
Building Decision TreesBuildTree(DataSet,Output) If all output values are the same in DataSet,
Then return a leaf node that says “predict this unique output” If all input values are the same,
Then return a leaf node that says “predict the majority output” Else find attribute X with highest Info Gain Suppose X has nX distinct values (i.e. X has arity nX).
Create and return a non-leaf node with nX children.
The ith child is built by calling BuildTree(DSi,Output)
Where DSi consists of all those records in DataSet
for which X = ith distinct value of X.
![Page 29: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/29.jpg)
General View of a Classifier
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0
1.0
2.0
3.0
Hypothesis:Decision Boundary for labeling
function
++
+ +
++
+
+
- -
-
- -
--
-
-
- +
++
-
-
-
+
+ Label: -Label: +
?
?
?
?
![Page 30: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/30.jpg)
© Daniel S. Weld 50
![Page 31: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/31.jpg)
Ok, so how does it perform?
51
![Page 32: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/32.jpg)
MPG Test set error
![Page 33: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/33.jpg)
MPG test set error
The test set error is much worse than the training set error…
…why?
![Page 34: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/34.jpg)
Decision trees will overfit• Our decision trees have no learning bias
– Training set error is always zero!• (If there is no label noise)
– Lots of variance– Will definitely overfit!!!– Must introduce some bias towards simpler trees
• Why might one pick simpler trees?
![Page 35: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/35.jpg)
Occam’s Razor• Why Favor Short Hypotheses?• Arguments for:
– Fewer short hypotheses than long ones→A short hyp. less likely to fit data by coincidence→Longer hyp. that fit data may might be coincidence
• Arguments against:– Argument above on really uses the fact that
hypothesis space is small!!!– What is so special about small sets based on the
complexity of each hypothesis?
![Page 36: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/36.jpg)
How to Build Small TreesSeveral reasonable approaches:• Stop growing tree before overfit
– Bound depth or # leaves– Base Case 3– Doesn’t work well in practice
• Grow full tree; then prune– Optimize on a held-out (development set)
• If growing the tree hurts performance, then cut back• Con: Requires a larger amount of data…
– Use statistical significance testing • Test if the improvement for any split is likely due to noise• If so, then prune the split!
– Convert to logical rules• Then simplify rules
![Page 37: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/37.jpg)
57
Reduced Error PruningSplit data into training & validation sets (10-33%)
Train on training set (overfitting)Do until further pruning is harmful:
1) Evaluate effect on validation set of pruning each possible node (and tree below it)
2) Greedily remove the node that most improves accuracy of validation set
![Page 38: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/38.jpg)
© Daniel S. Weld 58
Effect of Reduced-Error Pruning
Cut tree back to here
![Page 39: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/39.jpg)
59
Alternatively
• Chi-squared pruning– Grow tree fully– Consider leaves in turn
• Is parent split worth it?
• Compared to Base-Case 3?
![Page 40: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/40.jpg)
Consider this split
![Page 41: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/41.jpg)
A chi-square test
• Suppose that mpg was completely uncorrelated with maker.
• What is the chance we’d have seen data of at least this apparent level
of association anyway?By using a particular kind of chi-square test, the answer is 13.5%
Such hypothesis tests are relatively easy to compute, but involved
![Page 42: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/42.jpg)
Using Chi-squared to avoid overfitting
• Build the full decision tree as before• But when you can grow it no more, start to
prune:– Beginning at the bottom of the tree, delete splits
in which pchance > MaxPchance– Continue working you way up until there are no
more prunable nodes
MaxPchance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise
![Page 43: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/43.jpg)
Regularization• Note for Future: MaxPchance is a regularization parameter that helps us bias
towards simpler models
Smaller Trees Larger Trees
MaxPchanceIncreasingDecreasing
Exp
ect
ed
Te
st s
et
Err
or
We’ll learn to choose the value of magic parameters like this one later!
![Page 44: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/44.jpg)
Missing Data
65
mpg cylinders displacement horsepower weight acceleration modelyear maker
good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe
What to do??
![Page 45: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/45.jpg)
© Daniel S. Weld 66
Missing Data 1
• Assign attribute …
most common value, … given classification
Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c weak yesd11 m n s yes
• Don’t use this instance for learning?
most common value for attribute, or
![Page 46: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/46.jpg)
©Carlos Guestrin 2005-2009 69
Attributes with Many Values?
![Page 47: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/47.jpg)
© Daniel S. Weld 70
Attributes with many values
• So many values that it– Divides examples into tiny sets– Each set likely uniform high info gain– But poor predictor…
• Need to penalize these attributes
![Page 48: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/48.jpg)
Real-Valued inputsWhat should we do if some of the inputs are real-valued?
mpg cylinders displacementhorsepower weight acceleration modelyear maker
good 4 97 75 2265 18.2 77 asiabad 6 199 90 2648 15 70 americabad 4 121 110 2600 12.8 77 europebad 8 350 175 4100 13 73 americabad 6 198 95 3102 16.5 74 americabad 4 108 94 2379 16.5 73 asiabad 4 113 95 2228 14 71 asiabad 8 302 139 3570 12.8 78 america: : : : : : : :: : : : : : : :: : : : : : : :good 4 120 79 2625 18.6 82 americabad 8 455 225 4425 10 70 americagood 4 107 86 2464 15.5 76 europebad 5 131 103 2830 15.9 78 europe
Infinite number of possible split values!!!
Finite dataset, only finite number of relevant splits!
![Page 49: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/49.jpg)
“One branch for each numeric value” idea:
Hopeless: with such high branching factor will shatter the dataset and overfit
![Page 50: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/50.jpg)
Threshold splits
• Binary tree, split on attribute X at value t– One branch: X < t– Other branch: X ≥ t
![Page 51: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/51.jpg)
The set of possible thresholds
• Binary tree, split on attribute X– One branch: X < t– Other branch: X ≥ t
• Search through possible values of t– Seems hard!!!
• But only finite number of t’s are important– Sort data according to X into {x1,…,xm}
– Consider split points of the form xi + (xi+1 – xi)/2
![Page 52: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/52.jpg)
Picking the best threshold• Suppose X is real valued
• Define IG(Y|X:t) as H(Y) - H(Y|X:t)
• Define H(Y|X:t) =
H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)• IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is
greater than or less than t
• Then define IG*(Y|X) = maxt IG(Y|X:t)
• For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split
• Note, may split on an attribute multiple times, with different thresholds
![Page 53: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/53.jpg)
Example with MPG
![Page 54: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/54.jpg)
Example tree for our continuous
dataset
![Page 55: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/55.jpg)
What you need to know about decision trees
• Decision trees are one of the most popular ML tools– Easy to understand, implement, and use– Computationally cheap (to solve heuristically)
• Information gain to select attributes (ID3, C4.5,…)• Presented for classification, can be used for regression
and density estimation too• Decision trees will overfit!!!
– Must use tricks to find “simple trees”, e.g.,• Fixed depth/Early stopping• Pruning• Hypothesis testing
![Page 56: CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.](https://reader035.fdocuments.us/reader035/viewer/2022062309/5697bf921a28abf838c8f466/html5/thumbnails/56.jpg)
Acknowledgements
• Some of the material in the decision trees presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials:– http://www.cs.cmu.edu/~awm/tutorials
• Improved by – Carlos Guestrin & – Luke Zettlemoyer