2013-1 Machine Learning Lecture 02 - Decision Trees
-
Upload
dae-ki-kang -
Category
Documents
-
view
300 -
download
7
Transcript of 2013-1 Machine Learning Lecture 02 - Decision Trees
![Page 1: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/1.jpg)
Decision Tree Dae-Ki Kang
![Page 2: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/2.jpg)
Definition
• Definition #1
▫ A hierarchy of if-then’s
▫ Node – test
▫ Edge – direction of control
• Definition #2
▫ A tree that represents compression of data based on class
• Manually generated decision tree is not interesting at all!
![Page 3: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/3.jpg)
Decision tree for mushroom data
![Page 4: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/4.jpg)
Algorithms
• ID3
▫ Information gain
• C4.5 (=J48 in WEKA) (and See5/C5.0)
▫ Information gain ratio
• Classification and regression tree (CART)
▫ Gini gain
• Chi-squared automatic interaction detection (CHAID)
![Page 5: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/5.jpg)
Example from Tom Mitchell’s book
![Page 6: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/6.jpg)
Naïve strategy of choosing attributes
(i.e. choose the next available attribute)
Outlook Play=Yes Play=No
3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5)
Sunny
Overcast Rain
Play=Yes Play=No
9,11 (2) 1,2,8 (3)
Play=Yes Play=No
3,7,12,13 (4) (0)
Play=Yes Play=No
4,5,10 (3) 6,14 (2)
Temp Temp
Play=Y
Hot
Mild
Cool Hot
Mild
Cool
![Page 7: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/7.jpg)
How to generate decision trees?
• Optimal one ▫ Equal to (or harder than) NP-Hard
• Greedy one Greedy means big questions first Strategy – divide and conquer
▫ Choose an easy-to-understand test such that divided sub-data sets by the chosen test are the easiest to deal with Usually choose an attribute as a test Usually adopt impurity measure to see how easy to deal
with the sub-data sets
• Are there any other approaches? – there are many and open
![Page 8: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/8.jpg)
Impurity criteria
• Entropy Information Gain, Information Gain Ratio ▫ Most popular ▫ Entropy – Sum of -plogp ▫ IG – Entropy(S) - Sum of Entropy(sub-data t) * |t|/|S| ▫ IG favors Social Security Number or ID ▫ Information Gain Ratio
• Gini index Gini Gain (used in CART) ▫ Related with Area Under the Curve ▫ GG – 1 - Sum of fractions^2
• Misclassification rate ▫ (misclassified instances)/(all instances) ▫ Problematic – lead to many indistinguishable splits (where
other splits are more desirable)
![Page 9: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/9.jpg)
Using IG for choosing attributes
Outlook Play=Yes Play=No
3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5) Sunny
Overcast Rain Play=Yes Play=No
9,11 (2) 1,2,8 (3) Play=Yes Play=No
3,7,12,13 (4) (0)
Play=Yes Play=No
4,5,10 (3) 6,14 (2)
IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i)) IG(Outlook)= Entropy(Outlook) -|Sunny|/|Outlook|*Entropy(Sunny) -|Overcast|/|Outlook|*Entropy(Overcast) -|Rain|/|Outlook|*Entropy(Rain) Entropy(Outlook) =-(9/14)*log(9/14)-(5/14)*log(5/14) |Sunny|/|Outlook|*Entropy(Sunny) = 5/14*(-(2/5)*log(2/5)-(3/5)*log(3/5)) |Overcast|/|Outlook|*Entropy(Overcast) = 4/14*(-(4/4)*log(4/4)-(0/4)*log(0/4)) |Rain|/|Outlook|*Entropy(Rain) = 5/14*(-(3/5)*log(3/5)-(2/5)*log(2/5))
![Page 10: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/10.jpg)
Zero Occurrence
• When a feature is never occurred in the training set zero frequency PANIC: makes all terms zero
• Smoothing the distribution
▫ Laplacian Smoothing
▫ Dirichlet Priors Smoothing
▫ and many more (Absolute Discouting, Jelinek-Mercer smoothing, Katz smoothing, Good-Turing smoothing, etc.)
![Page 11: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/11.jpg)
Calculating IG with Laplacian smoothing
Outlook Play=Yes Play=No
3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5) Sunny
Overcast Rain Play=Yes Play=No
9,11 (2) 1,2,8 (3) Play=Yes Play=No
3,7,12,13 (4) (0)
Play=Yes Play=No
4,5,10 (3) 6,14 (2)
IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i)) IG(Outlook)= Entropy(Outlook) -|Sunny|/|Outlook|*Entropy(Sunny) -|Overcast|/|Outlook|*Entropy(Overcast) -|Rain|/|Outlook|*Entropy(Rain) Entropy(Outlook) =-(10/16)*log(10/16)-(6/16)*log(6/16) |Sunny|/|Outlook|*Entropy(Sunny) = 6/17*(-(3/7)*log(3/7)-(4/7)*log(4/7)) |Overcast|/|Outlook|*Entropy(Overcast) = 5/17*(-(5/6)*log(5/6)-(1/6)*log(1/6)) |Rain|/|Outlook|*Entropy(Rain) = 6/17*(-(4/7)*log(4/7)-(3/7)*log(3/7))
![Page 12: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/12.jpg)
Overfitting • Training set error
▫ Error of the classifier on the training data ▫ It is a bad idea to use up all data for training. You will be out of data to
evaluate the learning algorithm. • Test set error
▫ Error of the classifier on the test data ▫ Jackknife – Use n-1 examples to learn and 1 to test. Repeat n times. ▫ x-folds stratified cross-validation – Divide data into x-folds with the
same proportion of class. x-1 folds to train and 1 fold to test. Repeat x times.
• Overfitting ▫ The input data is incomplete (Quine) ▫ The input data do not reflect all possible cases. ▫ The input data can include noise. ▫ I.e. fit the classifier tightly to the input data is a bad idea.
• Occam’s razor ▫ Old axiom used to prove the existence of God. ▫ “plurality should not be posited without necessity”
![Page 13: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/13.jpg)
Razors and Canon
• Occam's razor (Ockham's razor) ▫ "Plurality is not to be posited without necessity" ▫ Similar to a principle of parsimony ▫ If two hypothesis have almost equal prediction power, we
prefer more concise one. • Hanlon's razor
▫ Never attribute to malice that which is adequately explained by stupidity.
• Morgan's Canon ▫ In no case is an animal activity to be interpreted in terms of
higher psychological processes if it can be fairly interpreted in terms of processes which stand lower in the scale of psychological evolution and development.
![Page 14: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/14.jpg)
Example: Playing Tennis
(taken from Andrew Moore’s)
Humidity
High Norm
(9+, 5-)
(3+, 4-) (6+, 1-)
( , ) ( , )( , ) log ( , ) log
( ) ( ) ( ) ( )
( , ) ( , )( , ) log ( , ) log
( ) ( ) ( ) ( )
0.151
h
P h p P n pI P h p P n p
P h P p P n P p
P h p P n pP h p P n p
P h P p P n P p
Wind
Weak Strong
(9+, 5-)
(6+, 2-) (3+, 3-)
( , ) ( , )( , ) log ( , ) log
( ) ( ) ( ) ( )
( , ) ( , )( , ) log ( , ) log
( ) ( ) ( ) ( )
0.048
w
P w p P s pI P w p P s p
P w P p P s P p
P w p P s pP w p P s p
P w P p P s P p
![Page 15: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/15.jpg)
Predication for Nodes
From Andrew Moore’s slides
What is the predication for each node?
![Page 16: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/16.jpg)
Predication for Nodes
![Page 17: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/17.jpg)
Recursively Growing Trees
Original Dataset
Partition it according to the value of the attribute we split on
cylinders = 4
cylinders = 5
cylinders = 6
cylinders = 8
From Andrew Moore slides
![Page 18: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/18.jpg)
Recursively Growing Trees
cylinders = 4 cylinders = 5 cylinders = 6 cylinders = 8
Build tree from These records..
Build tree from These records..
Build tree from These records..
Build tree from These records..
From Andrew Moore slides
![Page 19: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/19.jpg)
A Two Level Tree
Recursively growing trees
![Page 20: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/20.jpg)
When should We Stop Growing Trees?
Should we split this node ?
![Page 21: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/21.jpg)
Base Cases
• Base Case One: If all records in current data subset have the same output then don’t recurse
• Base Case Two: If all records have exactly the same set of input attributes then don’t recurse
![Page 22: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/22.jpg)
Base Cases: An idea
• Base Case One: If all records in current data subset have the same output then don’t recurse
• Base Case Two: If all records have exactly the same set of input attributes then don’t recurse
Proposed Base Case 3:
If all attributes have zero information gain then don’t recurse
Is this a good idea?
![Page 23: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/23.jpg)
Old Topic: Overfitting
![Page 24: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/24.jpg)
Pruning
• Prepruning (=forward pruning)
• Postpruning (=backward pruning)
▫ Reduced error pruning
▫ Rule-post pruning
![Page 25: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/25.jpg)
Pruning Decision Tree
• Stop growing trees in time
• Build the full decision tree as before.
• But when you can grow it no more, start to prune:
▫ Reduced error pruning
▫ Rule post-pruning
![Page 26: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/26.jpg)
Reduced Error Pruning
• Split data into training and validation set
• Build a full decision tree over the training set
• Keep removing node that maximally increases validation set accuracy
![Page 27: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/27.jpg)
Original Decision Tree
![Page 28: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/28.jpg)
Pruned Decision Tree
![Page 29: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/29.jpg)
Reduced Error Pruning
![Page 30: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/30.jpg)
Rule Post-Pruning
• Convert tree into rules
• Prune rules by removing the preconditions
• Sort final rules by their estimated accuracy
Most widely used method (e.g., C4.5)
Other methods: statistical significance test (chi-square)
![Page 31: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/31.jpg)
Real Value Inputs
• What should we do to deal with real value inputs?
mpg cylinders displacementhorsepower weight acceleration modelyear maker
good 4 97 75 2265 18.2 77 asia
bad 6 199 90 2648 15 70 america
bad 4 121 110 2600 12.8 77 europe
bad 8 350 175 4100 13 73 america
bad 6 198 95 3102 16.5 74 america
bad 4 108 94 2379 16.5 73 asia
bad 4 113 95 2228 14 71 asia
bad 8 302 139 3570 12.8 78 america
: : : : : : : :
: : : : : : : :
: : : : : : : :
good 4 120 79 2625 18.6 82 america
bad 8 455 225 4425 10 70 america
good 4 107 86 2464 15.5 76 europe
bad 5 131 103 2830 15.9 78 europe
![Page 32: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/32.jpg)
Information Gain
• x: a real value input
• t: split value
• Find the split value t such that the mutual information I(x, y: t) between x and the class label y is maximized.
![Page 33: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/33.jpg)
Pros and Cons
• Pros ▫ Easy to understand ▫ Fast learning algorithms (because they are greedy) ▫ Robust to noise ▫ Good accuracy
• Cons ▫ Unstable ▫ Hard to represent some functions (Parity, XOR, etc.) ▫ Duplication in subtrees ▫ Cannot be used to express all first order logic because
the test cannot refer to two or more different objects
![Page 34: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/34.jpg)
Generation of data from a decision
tree (based on the definition #2) • Decision tree with support for each node
Rule set
▫ support = # of training instances assigned for a node
• Rule set Instances
• In this way, one can combine multiple decision trees by combining rule sets
• cf. Bayesian classifiers Fractional instances
![Page 35: 2013-1 Machine Learning Lecture 02 - Decision Trees](https://reader033.fdocuments.us/reader033/viewer/2022042714/556879c1d8b42a3b7b8b50bf/html5/thumbnails/35.jpg)
Extensions and further considerations
• Extensions ▫ Alternating decision tree ▫ Naïve Bayes Tree ▫ Attribute Value Taxonomy guided Decision Tree ▫ Recursive Naïve Bayes ▫ and many more
• Further Researches ▫ Decision graph ▫ Bottom up generation of decision tree ▫ Evolutionary construction of decision tree ▫ Integrating two decision trees ▫ and many more