Decision Trees References: "Artificial Intelligence: A Modern Approach, 3 rd ed" (Pearson) 18.3-18.4...
-
Upload
alicia-craton -
Category
Documents
-
view
220 -
download
0
Transcript of Decision Trees References: "Artificial Intelligence: A Modern Approach, 3 rd ed" (Pearson) 18.3-18.4...
Decision Trees
References: • "Artificial Intelligence: A Modern Approach, 3rd ed" (Pearson) 18.3-18.4• http://onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html• http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree_overfitting.htm
What are they?
• A "flowchart" of logic• Example:– If my health is low:• run to cover
– Else:• if an enemy is nearby:
– Shoot it
• else:– scavenge for treasure
Another Example
• Goal: Decide if we'll wait for a table at a restaurant• Factors:
– Alternate: Is there another restaurant nearby?– Bar: Does the restaurant have a bar?– Fri / Sat: Is it a Friday or Saturday?– Hungry: Are we hungry?– Patrons: How many people {None, Some, Full}– Price: Price Range {$, $$, $$$}– Raining: Is it raining?– Reservation: Do we have a reservation?– Type: {French, Italian, Thai, Burger}– Wait: {0-10, 10-30, 30-60, >60}
Possible decision treePatrons
Wait
Alternate Hungry
N Y
None Some Full
N Y
>60 30-60 10-300-10
Reservation Fri/Sat AlternateY
No Yes No Yes
Bar RainingY Y Y
No Yes No Yes
N
No Yes
YN YN
No Yes No Yes
Analysis
• Pluses: – Easy to traverse– Naturally expressed as if/else's
• Negatives:– how do we build an optimal tree?
Sample Input# Alt Bar Fri Hun Pat Pr Ran Res Type Wait ??
1 Y N N Y S $$$ N Y Fr 0-10 Y
2 Y N N Y F $ N N Th 30-60 N
3 N Y N N S $ N N Bu 0-10 Y
4 Y N Y Y F $ Y N Th 10-30 Y
5 Y N Y N F $$$ N Y Fr >60 N
6 N Y N Y S $$ Y Y It 0-10 Y
7 N Y N N N $ Y N Bu 0-10 N
8 N N N Y S $$ Y Y Th 0-10 Y
9 N Y Y N F $ Y N Bu >60 N
10 Y Y Y Y F $$$ N Y It 10-30 N
11 N N N N N $ N N Th 0-10 N
12 Y Y Y Y F $ N N Bu 30-60 Y
Sample Input, cont
• We can also think of these as "training data"– For a decision tree we want to model– In this context, the input:
• is that of "Experts"• exemplifies the thinking you want to encode• is raw data we want to mine• …
• Note:– Doesn't contain all possibilities– There might be noise
Building a tree
• So how do we build a decision tree from input?• A lot of possible trees:– O(2n)– Some are good, some are bad:
• good == shallowest• bad == deepest
– Intractable to find the best• Using a greedy algorithm, we can find a pretty-good
one…
ID3 algorithm
• By Ross Quinlan (RuleQuest Research)• Basic idea:
– Choose the best attribute, i– Create a tree with n children
• n is the number of values for attribute i
– Divide the training set into n sub-sets• Where all items in a subset have the same value for attribute i.• If all items in the subset have the same output value, make this a
leaf node.• If not, recursively create a new sub-tree
– Only use those training examples in this subset– Don't consider attribute i any more.
"Best" attribute
• Entropy (in information theory)– A measure of uncertainty.– Gaining info == lowering entropy
• A fair coin = 1 bit of entropy• A loaded coin (always heads) = 0 bits of entropy– No uncertainty
• A fair roll of a d4 = 2 bits of entropy• A fair roll of a d8 = 3 bits of entropy
Entropy, cont.
• Given:– V: random variable, values v1…vn
• Entropy:
• Where:– P(x) is the probability of x.
Entropy, cont.• Example:– We have a loaded 4-sided dice– We get a• {1:10%, 2:5%, 3:25%, 4:60%}
𝐻 (𝑉 )=∑𝑘=1
𝑛
𝑃 (𝑣𝑘 )∗𝑙𝑜𝑔2(1
𝑃 (𝑣𝑘 ))
¿ (0.1∗𝑙𝑜𝑔2( 10.1 ))+(0.05∗𝑙𝑜𝑔2( 1
0.05 ))+(0.25∗𝑙𝑜𝑔2( 10.25 ))+(0.6∗𝑙𝑜𝑔2( 1
0.6 ))¿ (0.1∗𝑙𝑜𝑔2 (10 ) )+(0.05∗𝑙𝑜𝑔2 (20 ) )+(0.25∗𝑙𝑜𝑔2 ( 4 ) )+(0.6∗𝑙𝑜𝑔2 (1.67 ))≈0.332+0.216+0.5+0.444≈1.492
Recall: The entropy of a fair d4 is 2.0, so this dice is slightly more predictable.
Information Gain
• The reduction in entropy• In the ID3 algorithm,– We want to split the training cases based on
attribute i– Where attribute i gives us the most information• i.e. lowers entropy the most
Information Gain, cont.• Suppose:– E is a set of p training cases– There are n "results" of each training case: r1…rn
– We're considering splitting E based on attribute i, which has m possible values: Ai1…Aim
– Ej is the subset of E which has result j, where 1 <= j <= n
– size(E) is p; size(Ej) is the size of that subset– The resulting tree would have m branches.– The gain is:
– Split on the attribute with largest gain.
Original Example
• Let's take two potential attributes: Wait & Pat. Which is best to split on?
• StepA1: Calculate H(E)– 6 Yes, 6 No
Original Example, cont.• StepA2: Calculate H(Ewait)– 4 possible values, so we'd end up with 4 branches
• "0-10": {1, 3, 6, 7, 8, 11}; 4 Yes, 2 No • "10-30": {4, 10}; 1 Yes, 1 No• "30-60": {2, 12}; 1 Yes, 1 No• ">60": {5, 9}; 2 No
– Calculate the entropy of this split group𝐻 ( 0−10 )=( 4
6 )∗𝑙𝑜𝑔2( 1
(46))+( 2
6 )∗𝑙𝑜𝑔2( 1
(26))≈0.918
𝐻 ( 10−30 )=𝐻 ( 30−60 )=( 12 )∗𝑙𝑜𝑔2( 1
(12) )+( 1
2 )∗𝑙𝑜𝑔2( 1
(12))=1.0
𝐻 ( >60 )=0.0
𝐻 (𝐸𝑤𝑎𝑖𝑡 )=612
𝐻 ( 0−10 )+ 212
𝐻 ( 10−30 )+ 212
𝐻 ( 30−60 )+ 212
𝐻 ( >60 )=𝟎 .𝟕𝟗
Gain (E ,𝑤𝑎𝑖𝑡 )=H (E )−H (𝐸𝑤𝑎𝑖𝑡 )=1.0−0.79=𝟎 .𝟐𝟏
Original Example, cont.• StepA3: Calculate H(Epat)– 3 possible values, so we'd end up with 3 branches• "Some": {1,3,6,8}; 4 Yes • "Full": {2,4,5,9,10,12}; 2 Yes, 4 No• "None": {7,11}; 2 No
– Calculate the entropy of this split group0.0
𝐻 ( Full )=( 26 )∗𝑙𝑜𝑔2( 1
(26) )+( 4
6 )∗𝑙𝑜𝑔2( 1
(46))≈0.918
𝐻 (𝐸𝑝𝑎𝑡 )=412
𝐻 ( Some )+ 612
𝐻 ( Full )+ 212
𝐻 ( ¿ )=𝟎 .𝟒𝟓𝟗
Gain (E , pat )=H (E )−H (𝐸𝑝𝑎𝑡 )=1.0−0.459=𝟎.𝟓𝟒𝟏
So…which is better: splitting on wait, or pat?
Original Example, cont.
• Pat is much better (0.541 gain vs. 0.21 gain)• Here is the tree so far:
• Now we need a subtree to handle the case where Patrons==Full– Note: The training set is smaller now (6 vs. 12)
NY
Patrons
Some Full None
{1,3,6,8} {7,11}{2,4,5,9,10,12}
Original Example, cont.
• Look at two alternatives: Alt & Type• Calculate entropy of remaining group:– We actually already calculated it (H("Full") in
StepA3)– The value becomes H(E) for this recursive
application of ID3.– H(E)≈0.918
Original Example, cont.
• Calculate entropy if we split on Alt– Two possible values: "Yes" and "No"• "Yes“ (Alt): {2,4,5,10,12}; 2 Yes, 3 No (Result)• "No“ (Alt): {9}; 1 No (Result)
𝐻 ( Yes )=25∗𝑙𝑜𝑔2( 1
( 25 ) )+
35∗𝑙𝑜𝑔2(
1
( 35 )
)≈0.971
𝐻 ( No )=0.0
𝐻 (𝐸𝑎𝑙𝑡 )=56𝐻 ( Yes )+ 1
6𝐻 ( No )≈0.809
𝐺𝑎𝑖𝑛 (𝐸 ,𝑎𝑙𝑡 )≈0.918−0.809≈0.109
Original Example, cont.• Calculate entropy if we split on Type– 4 possible values: "French", "Thai", "Burger", and
"Italian"• "French": {5}; 1 No• "Thai": {2,4}; 1 Yes, 1 No• "Burger": {9,12}; 1 Yes, 1 No• "Italian": {10}; 1 No
𝐻 ( French )=𝐻 ( Italian )=0.0
𝐻 ( Burger )=𝐻 ( Thai )=12∗𝑙𝑜𝑔2( 1
(12))+ 1
2∗𝑙𝑜𝑔2( 1
(12) )=1.0
𝐻 (𝐸 𝑡𝑦𝑝𝑒 )= 16𝐻 ( French )+ 2
6𝐻 ( Thai )+ 2
6𝐻 ( Burger )+ 1
6𝐻 ( Italian )≈0.667
𝐺𝑎𝑖𝑛 (𝐸 , 𝑡𝑦𝑜𝑒)≈0.918−0.667≈0.251Which is better: alt or type?
Original Example, cont.• Type is better (0.251 gain vs. 0.109 gain)– Hungry, Price, Reservation, Est would give you same gain.
• Here is the tree so far:
• Recursively make two more sub-trees…
Type NY
Patrons
Some Full None
{1,3,6,8} {7,11}{2,4,5,9,10,12}
French Thai ItalianBurger
N N{5} {10}{2,4} {9,12}
Original Example, cont.• Here's one possibility (skipping the details):
N NFri
Type NY
Patrons
Some Full None
{1,3,6,8} {7,11}{2,4,5,9,10,12}
French Thai ItalianBurger
{5} {10}{2,4} Alt {9,12}
Yes No Yes No
{4} {2} {12} {9}N NY Y
Using a decision tree
• This algorithm will perfectly match all training cases.
• The hope is that this will generalize to novel cases.
• Let's take a new case (not found in training)– Alt="No", Bar="Yes", Fri="No", Pat="Full"– Hungry="Yes", Price="$$", Rain=Yes– Reservation="Yes", Type="Italian", Est="30-60"
• Will we wait?
N NFri
Type NY
Original Example, cont.• Here's the decision process:
Patrons
Some Full None
{1,3,6,8} {7,11}{2,4,5,9,10,12}
French Thai ItalianBurger
{5} {10}{2,4} Alt {9,12}
Yes No Yes No
{4} {2} {12} {9}N NY Y
• Alt="No"• Bar="Yes"• Fri="No"• Pat="Full"• Hungry="Yes"• Price="$$"• Rain=Yes• Reservation="Yes"• Type="Italian"• Est="30-60"
So…No, we won't wait.
Pruning
• Sometimes an exact fit is not necessary– The tree is too big (deep)– The tree isn't generalizing well to new cases
(overfitting)– We don't have a lot of training cases:
• We would get close to the same results removing the attr node, and labeling it as a leaf (r1)
Attr
r1 r2 r1{47}
{98}{11, 41}
v1 v2 v3
Chi-Squared Test• The chi-squared test can be used to determine
if a decision node is statistically significant.• Example1:
• Is there a strong significance between hair color and eye color?
RAW DATA
Hair Color
Light Dark
Brown 32 12
Eye Color Green/Blue 14 22
Other 6 9
Chi-Squared Test• Example2:
• Is there a strong significance between console preference and passing etgg1803?
RAW DATA
Console PreferencePS3 PC XBox360 Wii None
Pass 5 12 6 4 15
Pass ETGG1803?
Fail 4 2 5 4 2
Chi-Squared Test• Steps:• 1) Calculate row, column, and overall totals
Hair Color
Light Dark
Black 32 12
Eye Color Green/Blue 14 22
Other 6 9
Hair Color
Light Dark
Black 32 12 44
Eye Color Green/Blue
14 22 36
Other 6 9 15
52 43 95
Chi-Squared Test• 2) Calculate expected values of each cell– RowTotal * ColTotal / OverallTotal
EXPECTED
Hair Color
Light Dark
Black 24.08 19.92
Eye Color Green/Blue 19.7 16.3
Other 8.21 6.8
Hair Color
Light Dark
Black 32 12 44
Eye Color Green/Blue
14 22 36
Other 6 9 15
52 43 95
52*44/95
36*43/95
Chi-Squared Test• 3) Calculate χ2
CHI-SQUARED
Hair Color
Light Dark
Black 2.6 3.15
Eye Color Green/Blue 1.65 2.0
Other 0.6 0.71
(32-24.08)2/24.08
(22-16.3)2/16.3
EXPECTED
Light Dark
Black 24.08 19.92
Green/Blue 19.7 16.3
Other 8.21 6.8
RAW
Light Dark
Black 32 12
Green/Blue 14 22
Other 6 9
χ2 = 2.6 + 3.15 + 1.65 + 2.0 + 0.6 + 0.71 = 10.71
Chi-Squared test
• 4) Look up your chi-squared value in a table– The degrees-of-freedom (dof) is (numRows-
1)*(numCols-1)– http://home.comcast.net/~
sharov/PopEcol/tables/chisq.html• If the table entry (usually for 0.05) is less than your chi-
squared, it's statistically significant.
– scipy (www.scipy.org)import scipy.stats
if 1.0 – scipy.stats.chi2.cdf(chiSquared, dof) > 0.05:# Statistically insignificant
Chi-squared test
• We have a χ2 value of 10.71 (dof = 2)• The table entry for 5% probability (0.05) is 5.99• 10.71 is bigger than 5.99, so this is statistically
significant• For the console example– χ2 = 8.16– dof = 4– table entry for 5% probability is 9.49– So…this isn't a statistically significant connection.
Chi-Squared Pruning
• Bottom-up– Do a depth-first traversal– do your test after calling the function recursively
on your children
Original Example, cont.• Look at "Burger?" first
N NFri
Type NY
Patrons
Some Full None
[4Y,0N] [0Y,2N][2Y,4N]
French Thai ItalianBurger
[0Y,1N] [0Y,1N][1Y,1N] Alt [1Y,1N]
Yes No Yes No
[1Y,0N]
[0Y,1N] [1Y,0N] [0Y,1N]N NY Y
[6Y,6N]
Original Example, cont.• Do a Chi-squared test:
Burger
Alt [1Y,1N]
Yes No
[1Y,0N] [0Y,1N]
NY
Yes No
Yes: Wait 1 0
No: Don't 0 1
Yes No
Yes: Wait 1 0 1
No: Don't 0 1 1
1 1 2
Yes No
Yes: Wait 0.5 0.5
No: Don't 0.5 0.5
Yes No
Yes: Wait 0.5 0.5
No: Don't 0.5 0.5
χ2 = 0.5 + 0.5 + 0.5 + 0.5 = 2.0 dof = (2-1)*(2-1) = 1 Table(0.05, 1) = 3.84
So…prune it!
Note: we'll have a similar case with Thai. So…prune it too!
Original
Totals
Expected
Chi's
Original Example, cont.• Here's one possibility:
N NFri
Type NY
Patrons
Some Full None
[4Y,0N] [0Y,2N][2Y,4N]
French Thai ItalianBurger[0Y,1N] [0Y,1N][1Y,1N] Alt [1Y,1N]
Yes No Yes No
[1Y,0N][0Y,1N] [1Y,0N] [0Y,1N]
N NY Y
[6Y,6N]
N N
Type NY
Patrons
Some Full None
[4Y,0N] [0Y,2N][2Y,4N]
French Thai ItalianBurger[0Y,1N] [0Y,1N][1Y,1N] [1Y,1N]Y
[6Y,6N]
Y
Original Example, cont.
N N
Type[2Y,4N]
French Thai ItalianBurger[0Y,1N] [0Y,1N][1Y,1N] [1Y,1N]Y Y
I got a chi-squared value of 1.52, dof=3…prune it!
Original Example, cont.• Here's one possibility:
N NY
Patrons
Some Full None
[4Y,0N] [0Y,2N][2Y,4N]
[6Y,6N]
N N
Type NY
Patrons
Some Full None
[4Y,0N] [0Y,2N][2Y,4N]
French Thai ItalianBurger[0Y,1N] [0Y,1N][1Y,1N] [1Y,1N]Y
[6Y,6N]
Y
Pruning Example, cont.
N NY
Patrons
Some Full None
[4Y,0N] [0Y,2N][2Y,4N]
[6Y,6N]
I got a chi-squared value of 6.667, dof=2. So…keep it!
Note: if the evidence were stronger (more training cases) in the burger, thai branch, we wouldn't have pruned it