Decision Tree Learning
description
Transcript of Decision Tree Learning
![Page 1: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/1.jpg)
1
Decision Tree Learning
Ata Kaban
The University of Birmingham
![Page 2: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/2.jpg)
2
Today we learn about: Decision Tree Representation Entropy, Information Gain ID3 Learning algorithm for classification Avoiding overfitting
![Page 3: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/3.jpg)
3
Decision Tree Representation for ‘Play Tennis?’
Internal node
~ test an attribute Branch
~ attribute value Leaf
~ classification result
![Page 4: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/4.jpg)
4
When is it useful?
Medical diagnosisEquipment diagnosisCredit risk analysisetc
![Page 5: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/5.jpg)
5
![Page 6: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/6.jpg)
6
Sunburn Data CollectedName Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
![Page 7: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/7.jpg)
7
Decision Tree 1 is_sunburned
Dana, Pete
Alex Sarah
Emily JohnAnnieKatie
Height
average tallshort
Hair colour Weight
brownredblonde
Weight
light average heavy
light average heavy
Hair colourblonde
red brown
![Page 8: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/8.jpg)
8
Sunburn sufferers are ... If height=“average” then
– if weight=“light” then• return(true) ;;; Sarah
– elseif weight=“heavy” then• if hair_colour=“red” then
– return(true) ;;; Emily
elseif height=“short” then– if hair_colour=“blonde” then
• if weight=“average” then– return(true) ;;; Annie
else return(false) ;;;everyone else
![Page 9: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/9.jpg)
9
Decision Tree 2
Sarah, Annie
is_sunburned
Emily Alex
Dana, KatiePete,
John
Lotion used
Hair colour Hair colour
brownred
blonde
yesno
redbrownblonde
![Page 10: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/10.jpg)
10
Decision Tree 3
Sarah, Annie
is_sunburned
Alex, Pete, JohnEmily
Dana, Katie
Hair colour
Lotion used
blonde redbrown
no yes
![Page 11: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/11.jpg)
11
Summing up
Irrelevant attributes do not classify the data well
Using irrelevant attributes thus causes larger decision trees
a computer could look for simpler decision trees
Q: How?
![Page 12: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/12.jpg)
12
A: How WE did it? Q: Which is the best attribute for splitting up
the data? A: The one which is most informative for the
classification we want to get. Q: What does it mean ‘more informative’? A: The attribute which best reduces the
uncertainty or the disorder Q: How can we measure something like that? A: Simple – just listen ;-)
![Page 13: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/13.jpg)
13
We need a quantity to measure the disorder in a set of examples
S={s1, s2, s3, …, sn} where s1=“Sarah”, s2=“Dana”, …
Then we need a quantity to measure the amount of reduction of the disorder level in the instance of knowing the value of a particular attribute
![Page 14: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/14.jpg)
14
What properties should the Disorder (D) have?
Suppose that D(S)=0 means that all the examples in S have the same class
Suppose that D(S)=1 means that half the examples in S are of one class and half are the opposite class
![Page 15: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/15.jpg)
15
Examples
D({“Dana”,“Pete”}) =0 D({“Sarah”,“Annie”,“Emily” })=0 D({“Sarah”,“Emily”,“Alex”,“John” })=1 D({“Sarah”,“Emily”, “Alex” })=?
![Page 16: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/16.jpg)
16
0.918
D({“Sarah”,“Emily”, “Alex” })=0.918
Proportion of positive examples, p+
0 0.5 1.0
Dis
orde
r, D
1.0
0
0.67
![Page 17: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/17.jpg)
17
Definition of Disorder
)(loglog),( 22 SEntropy
n
n
n
n
n
n
n
nnnD
The Entropy measures the disorder of a set S containing a total of n examples of which n+ are positive and n- are negative and it is given by
xx ?2 2 means log
where
Check it!
D(0,1) = ? D(1,0)=? D(0.5,0.5)=?
![Page 18: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/18.jpg)
18
Back to the beach (or the disorder of sunbathers)!
85
log85
83
log83
)5,3( 22 D
D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})
954.0
![Page 19: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/19.jpg)
19
Some more useful properties of the Entropy
),(),( nmDmnD
0),0( mD
1),( mmD
![Page 20: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/20.jpg)
20
So: We can measure the disorder What’s left:
– We want to measure how much by knowing the value of a particular attribute the disorder of a set would reduce.
![Page 21: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/21.jpg)
21
The Information Gain measures the expected reduction in entropy due to splitting on an attribute A
)(
)(||
||)(),(
AValuesvv
v SEntropyS
SSEntropyASGain
the average disorder is just the weighted sum of the disorders in the branches (subsets) created by the values of A.
We want:
-large Gain
-same as: small avg disorder created
![Page 22: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/22.jpg)
22
Back to the beach: calculate the Average
Disorder associated with Hair Colour
)( blondeblonde SD
S
S)( red
red SDS
S)( brown
brown SDS
S
Sarah AnnieDana Katie
Emily Alex Pete John
Hair colour
brownblondered
![Page 23: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/23.jpg)
23
…Calculating the Disorder of the “blondes”
5.08
4)(
S
SSD
S
S blondeblonde
blonde
The first term of the sum: D(Sblonde)=
D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2,2)
=1
![Page 24: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/24.jpg)
24
…Calculating the disorder of the others
The second and third terms of the sum: Sred={“Emily”}
Sbrown={ “Alex”, “Pete”, “John”}.
These are both 0 because within each set all the examples have the same class
So the avg disorder created when splitting on ‘hair colour’ is 0.5+0+0=0.5
![Page 25: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/25.jpg)
25
Which decision variable minimises the disorder?Test Disorder
Hair 0.5 – this what we just computed
height 0.69
weight 0.94
lotion 0.61
Which decision variable maximises the Info Gain then?
Remember it’s the one which minimises the avg disorder (see slide 21 for memory refreshing).
these are the avg disorders of the other attributes, computed in the same way
![Page 26: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/26.jpg)
26
So what is the best decision tree?
?
is_sunburned
Alex, Pete, John
Emily
Sarah AnnieDana Katie
Hair colour
brownblonde
red
![Page 27: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/27.jpg)
27
ID3 algorithm
Greedy search in the hypothesis space
![Page 28: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/28.jpg)
28
Is this all? So much simple?
Of course not… where do we stop growing the tree? what if there are noisy (mislabelled) data as well in data set?
![Page 29: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/29.jpg)
29
Overfitting in Decision Tree Learning
![Page 30: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/30.jpg)
30
Overfitting
Consider the error of hypothesis h over:– Training data: error_train(h)– The whole data set (new data as well):
error_D(h) If there is another hypothesis h’ such
that error_train(h) < error_train(h’) and error_D(h)>error_D(h’) then we say that hypothesis h overfits the training data.
![Page 31: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/31.jpg)
31
How can we avoid overfitting?
Split the data into training set & validation set Train on the training set and stop growing the
tree when further data split deteriorates performance on validation set
Or: grow the full tree first and then post-prune What if data is limited?
![Page 32: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/32.jpg)
32
…looks a bit better now
![Page 33: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/33.jpg)
33
Summary
Decision Tree Representation Entropy, Information Gain ID3 Learning algorithm Overfitting and how to avoid it
![Page 34: Decision Tree Learning](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815007550346895dbddab6/html5/thumbnails/34.jpg)
34
When to consider Decision Trees?
If data is described by a finite number of attributes, each having a (finite) number of possible values
The target function is discrete valued (i.e. classification problem)
Possibly noisy data Possibly missing values
E.g.:– Medical diagnosis– Equipment diagnosis– Credit risk analysis– etc