Decision Tree Learning Mitchell, Chapter 3 - Washington State
Transcript of Decision Tree Learning Mitchell, Chapter 3 - Washington State
![Page 1: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/1.jpg)
Decision Tree Learning Mitchell, Chapter 3
CptS
570 Machine LearningSchool of EECS
Washington State University
![Page 2: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/2.jpg)
Outline
Decision tree representationID3 learning algorithmEntropy and information gainOverfittingEnhancements
![Page 3: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/3.jpg)
Decision Tree for PlayTennis
![Page 4: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/4.jpg)
Decision Trees
Decision tree representationEach internal node test an attributeEach branch corresponds to attribute valueEach leaf node assigns a classification
How would we represent:∧, ∨, XOR(A ∧ B) ∨ (C ∧ ¬D ∧ E)M of N
![Page 5: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/5.jpg)
When to Consider Decision Trees
Instances describable by attribute-value pairsTarget function is discrete valuedDisjunctive hypothesis may be requiredPossibly noisy training dataExamples:
Equipment or medical diagnosisCredit risk analysisModeling calendar scheduling preferences
![Page 6: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/6.jpg)
Top-Down Induction of Decision Trees
Main loop (ID3, Table 3.1):A the “best” decision attribute for next nodeAssign A as decision attribute for nodeFor each value of A, create new descendant of nodeSort training examples to leaf nodesIf training examples perfectly classified
Then STOPElse iterate over new leaf nodes
![Page 7: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/7.jpg)
Which Attribute is Best?
![Page 8: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/8.jpg)
EntropyS is a sample of training examplesp⊕ is the proportion of positive examples in Sp is the proportion of negative examples in SEntropy measures the impurity of S
Entropy(S) ≡– p⊕
log2
p⊕
– p log2 p
![Page 9: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/9.jpg)
EntropyEntropy(S) = expected number of bits needed to encode class (⊕ or ) of randomly drawn member of S (under the optimal, shortest-length code)Why? Information theory
Optimal length code assigns (– log2 p) bits to message having probability p
So, expected number of bits to encode ⊕ or of random member of S:p⊕ (– log2 p⊕ ) + p (– log2 p )Entropy(S) ≡ – p⊕ log2 p⊕ – p log2 p
![Page 10: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/10.jpg)
Information Gain
Gain(S,A) = expected reduction in entropy due to sorting on attribute A
)()(),()(
vAValuesv
v SEntropySS
SEntropyASGain ∑∈
−≡
![Page 11: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/11.jpg)
Training Examples: PlayTennisDay Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
![Page 12: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/12.jpg)
Selecting the Next Attribute
![Page 13: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/13.jpg)
Selecting the Next Attribute
![Page 14: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/14.jpg)
Hypothesis Space Search by ID3
![Page 15: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/15.jpg)
Hypothesis Space Search by ID3
Hypothesis space is complete!Target function surely in there…
Outputs a single hypothesis (which one?)Can't play 20 questions…
No backtrackingLocal minima…
Statistically-based search choicesRobust to noisy data…
Inductive bias: “prefer shortest tree”
![Page 16: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/16.jpg)
Inductive Bias in ID3Note H is the power set of instances X
Unbiased?Not really…
Preference for short trees with high information gain attributes near the root
Bias is a preference for some hypotheses, rather than a restriction on the hypothesis space HOccam's razor
Prefer the shortest hypothesis that fits the data
![Page 17: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/17.jpg)
Occam’s
RazorWhy prefer short hypotheses?Argument in favor:
Fewer short hypotheses than long hypothesesShort hypothesis that fits data unlikely to be coincidenceLong hypothesis that fits data might be coincidence
Argument opposed:There are many ways to define small sets of hypothesesE.g., all trees with a prime number of nodes that use attributes beginning with “Z”What's so special about small sets based on the size of the hypothesis?
![Page 18: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/18.jpg)
Overfitting
in Decision Trees
Consider adding noisy training example #15:(<Sunny, Hot, Normal, Strong>, PlayTennis = No)
What effect on earlier tree?
![Page 19: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/19.jpg)
Overfitting
Consider error of hypothesis h overTraining data: errortrain(h)Entire distribution D of data: errorD(h)
Hypothesis h ∈ H overfits the training data if there is an alternative hypothesis h’ ∈ H such that
errortrain(h) < errortrain(h’)errorD(h) > errorD(h’)
![Page 20: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/20.jpg)
Overfitting
in Decision Tree Learning
![Page 21: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/21.jpg)
Avoiding Overfitting
How can we avoid overfitting?Stop growing when data split not statistically significantGrow full tree, then post-prune
How to select “best” tree:Measure performance over training dataMeasure performance over separate validation data set
MDLMinimize size(tree) + size(misclassifications(tree))
![Page 22: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/22.jpg)
Reduced-Error Pruning
Split data into training and validation setDo until further pruning is harmful:
Evaluate impact on validation set of pruning each possible node (plus those below it)Greedily remove the one that most improves validation set accuracy
Produces smallest version of most accurate subtreeWhat if data is limited?
![Page 23: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/23.jpg)
Effect of Reduced-Error Pruning
![Page 24: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/24.jpg)
Rule Post-Pruning
Generate decision tree, thenConvert tree to equivalent set of rulesPrune each rule independently of othersSort final rules into desired sequence for use
Perhaps most frequently used method (e.g., C4.5)
![Page 25: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/25.jpg)
Converting a Tree to Rules
If (Outlook=Sunny) Λ
(Humidity=High)Then PlayTennis=No
If (Outlook=Sunny) Λ
(Humidity=Normal)Then PlayTennis=Yes…
![Page 26: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/26.jpg)
Continuous Valued Attributes
Create a discrete attribute to test continuous values
Temperature = 82.5(Temperature > 72.3) = true, false
Temperature 40 48 60 72 80 90PlayTennis No No Yes Yes Yes No
![Page 27: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/27.jpg)
Attributes with Many ValuesProblem:
If attribute has many values, Gain will select itImagine using Date = Jun_3_1996 as attribute
One approach: Use GainRatio instead
where Si is the subset of S for which A has value vi
),(),(),(
ASmationSplitInforASGainASGainRatio ≡
SS
SS
ASmationSplitInfor iAValues
i
i2
)(
1log),( ∑
=
−≡
![Page 28: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/28.jpg)
Attributes with CostsConsider
Medical diagnosis, BloodTest has cost $150Robotics, Width_from_1ft has cost 23 sec.
How to learn a consistent tree with low expected cost?One approach: replace gain by
Tan and Schlimmer (1990)
Nunez (1988)
where w ∈ [0,1] determines importance of cost
)(),(2
ACostASGain
w
ASGain
ACost )1)((12 ),(
+−
![Page 29: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/29.jpg)
Unknown Attribute ValuesWhat if some examples missing values of A?Use training example anyway, sort through tree
If node n tests A, assign most common value of Aamong other examples sorted to node nAssign most common value of A among other examples with same target valueAssign probability pi to each possible value vi of A
Assign fraction pi of example to each descendant in tree
Classify new examples in same fashion
![Page 30: Decision Tree Learning Mitchell, Chapter 3 - Washington State](https://reader031.fdocuments.us/reader031/viewer/2022022522/621811c945f3284c677cd314/html5/thumbnails/30.jpg)
Summary: Decision-Tree Learning
Most popular symbolic learning methodLearning discrete-valued functionsInformation-theoretic heuristicHandles noisy dataDecision trees completely expressiveBiased towards simpler treesID3 C4.5 C5.0 (www.rulequest.com)J48 (WEKA) ≈ C4.5