TDIDT Learning

29
TDIDT Learning

description

TDIDT Learning. Decision Tree. Internal nodes  tests on some property Branches from internal nodes  values of the associated property Leaf nodes  classifications An individual is classified by traversing the tree from its root to a leaf. Sample Decision Tree. Decision Tree Learning. - PowerPoint PPT Presentation

Transcript of TDIDT Learning

Page 1: TDIDT Learning

TDIDT Learning

Page 2: TDIDT Learning

Decision Tree

• Internal nodes tests on some property• Branches from internal nodes values of the

associated property• Leaf nodes classifications• An individual is classified by traversing the

tree from its root to a leaf

Page 3: TDIDT Learning

Sample Decision Tree

Age

<20 >20 & < 50

>50

Smoking

Yes No

Weight

Obese Normal Over

Under

Exercise

Never Seldom

Regular

High

Low

Moderate Low Low

High Moderate

Moderate High

Page 4: TDIDT Learning

Decision Tree Learning

• Learning consists of constructing a decision tree that allows the classification of objects.

• Given a set of training instances, a decision tree is said to represent the classifications if it properly classifies all of the training instances (i.e., is consistent).

Page 5: TDIDT Learning

TDIDT• Function Induce-Tree(Example-set, Properties)

– If all elements in Example-set are in the same class, then return a leaf node labeled with that class

– Else if Properties is empty, then return a leaf node labeled with the majority class in Example-set

– Else• Select P from Properties (*)• Remove P from Properties• Make P the root of the current tree• For each value V of P

– Create a branch of the current tree labeled by V– Partition_V Elements of Example-set with value V for P– Induce-Tree(Partition_V, Properties)– Attach result to branch V

Page 6: TDIDT Learning

Illustrative Training SetRisk Assessment for Loan Applications

Client # Credit History Debt Level Collateral Income Level RISK LEVEL

1 Bad High None Low HIGH

2 Unknown High None Medium HIGH

3 Unknown Low None Medium MODERATE

4 Unknown Low None Low HIGH

5 Unknown Low None High LOW

6 Unknown Low Adequate High LOW

7 Bad Low None Low HIGH

8 Bad Low Adequate High MODERATE

9 Good Low None High LOW

10 Good High Adequate High LOW

11 Good High None Low HIGH

12 Good High None Medium MODERATE

13 Good High None High LOW

14 Bad High None Medium HIGH

Page 7: TDIDT Learning

ID3 Example (I)

1) Choose Income as root of tree.

Income High

Medium Low

1,4,7,11 2,3,12,14 5,6,8,9,10,13 4) 2) 3)

2) All examples are in the same class, HIGH. Return Leaf Node.

Income High

Medium Low

HIGH 2,3,12,14 5,6,8,9,10,13

3) Choose Debt Level as root of subtree.

Debt Level High Low

3 2,12,14 3b) 3a)

3a) All examples are in the same class, MODERATE. Return Leaf node.

Debt Level High Low

MODERATE 2,12,14 3b)

Page 8: TDIDT Learning

ID3 Example (II)

3b) Choose Credit History as root of subtree.

Credit History Good

Bad Unknown

2 14 12 3b’’’)

3b’’) 3b’)

3b’-3b’’’) All examples are in the same class. Return Leaf nodes.

Credit History Good

Bad Unknown

HIGH HIGH MODERATE

4) Choose Credit History as root of subtree.

Credit History Good

Bad Unknown

5,6 8 9,10,13 4c) 4b) 4a)

4a-4c) All examples are in the same class. Return Leaf nodes.

Credit History Good

Bad Unknown

LOW MODERATE LOW

Page 9: TDIDT Learning

ID3 Example (III)

Attach subtrees at appropriate places.

Income

High Medium Low

HIGH Debt Level High Low

MODERATE Credit History

Good Bad Unknown

HIGH HIGH MODERATE

Credit History Good

Bad Unknown

LOW MODERATE LOW

Page 10: TDIDT Learning

Non-Uniqueness

• Decision trees are not unique:– Given a set of training instances, there

generally exists a number of decision trees that represent the classifications

• The learning problem states that we should seek not only consistency but also generalization. So, …

Page 11: TDIDT Learning

TDIDT’s Question

Given a training set, which of all of the decision trees consistent with that training set has the

greatest likelihood of correctly classifying unseen instances of the population?

Page 12: TDIDT Learning

ID3’s (Approximate) Bias

• ID3 (and family) prefers the simplest decision tree that is consistent with the training set.

• Occam’s Razor Principle:– “It is vain to do with more what can be done with

less...Entities should not be multiplied beyond necessity.”

– i.e., always accept the simplest answer that fits the data / avoid unnecessary constraints.

Page 13: TDIDT Learning

ID3’s Property Selection• Each property of an instance may be thought of as

contributing a certain amount of information to its classification.– For example, determine shape of an object: number of

sides contributes a certain amount of information to the goal; color contributes a different amount of information.

• ID3 measures the information gained by making each property the root of the current subtree and subsequently chooses the property that produces the greatest information gain.

Page 14: TDIDT Learning

Discussion (I)

• In terms of learning as search, ID3 works as follows:– Search space = set of all possible decision trees– Operations = adding tests to a tree– Form of hill-climbing: ID3 adds a subtree to the current

tree and continues its search (no backtracking, local minima)

• It follows that ID3 is very efficient, but its performance depends on the criteria for selecting properties to test (and their form)

Page 15: TDIDT Learning

Discussion (II)

• ID3 handles only discrete attributes. Extensions to numerical attributes have been proposed, the most famous being C5.0

• Experience shows that TDIDT learners tend to produce very good results on many problems

• Trees are most attractive when end users want interpretable knowledge from their data

Page 16: TDIDT Learning

Entropy (I)

• Let S be a set examples from c classes

i

c

ii ppSEntropy 2

1

log)(

Where Where ppii is the proportion of examples of is the proportion of examples of SS belonging to class belonging to class ii. (Note, we define . (Note, we define 0log0=0)0log0=0)

Page 17: TDIDT Learning

Entropy (II)

• Intuitively, the smaller the entropy, the purer the partition

• Based on Shannon’s information theory (c=2):– If p1=1 (resp. p2=1), then receiver knows example is

positive (resp. negative). No message need be sent.– If p1=p2=0.5, then receiver needs to be told the class of the

example. 1-bit message must be sent.– If 0<p1<1, then receiver needs a less than 1 bit on average

to know the class of the example.

Page 18: TDIDT Learning

Information Gain

• Let p be a property with n outcomes• The information gained by partitioning a set S

according to p is:

)()(),(1

i

n

i

i SEntropyS

SSEntropypSGain

Where Where SSii is the subset of is the subset of SS for which for which property property pp has its has its iith valueth value

Page 19: TDIDT Learning

Play Tennis

OUTLOOKOUTLOOK TEMERATURTEMERATUREE

HUMIDITYHUMIDITY WINDWIND PLAY PLAY TENNISTENNIS

OvercastOvercast HotHot HighHigh WeakWeak YesYes

OvercastOvercast HotHot NormalNormal WeakWeak YesYes

SunnySunny HotHot HighHigh WeakWeak NoNo

SunnySunny MildMild NormalNormal StrongStrong YesYes

RainRain CoolCool NormalNormal StrongStrong NoNo

SunnySunny MildMild HighHigh WeakWeak NoNo

What is the ID3 induced tree?

Page 20: TDIDT Learning

ID3’s Splitting Criterion

• The objective of ID3 at each split is to increase information gain, or equivalently, to lower entropy. It does so as much as possible– Pros: Easy to do– Cons: May lead to overfitting

Page 21: TDIDT Learning

Overfitting

Given a hypothesis space H, a hypothesis hH is said to overfit the training data if there exists some alternative hypothesis h’ H,

such that h has smaller error than h’ over the training examples, but h’ has smaller error

than h over the entire distribution of instances

Page 22: TDIDT Learning

Avoiding Overfitting

• Two alternatives– Stop growing the tree, before it begins to overfit

(e.g., when data split is not statistically significant)– Grow the tree to full (overfitting) size and post-

prune it

• Either way, when do I stop? What is the correct final tree size?

Page 23: TDIDT Learning

Approaches

• Use only training data and a statistical test to estimate whether expanding/pruning is likely to produce an improvement beyond the training set

• Use MDL to minimize size(tree) + size(misclassifications(tree))

• Use a separate validation set to evaluate utility of pruning

• Use richer node conditions and accuracy

Page 24: TDIDT Learning

Reduced Error Pruning

• Split dataset into training and validation sets• Induce a full tree from the training set• While the accuracy on the validation set increases

– Evaluate the impact of pruning each subtree, replacing its root by a leaf labeled with the majority class for that subtree

– Remove the subtree that most increases validation set accuracy (greedy approach)

Page 25: TDIDT Learning

Rule Post-pruning

• Split dataset into training and validation sets• Induce a full tree from the training set• Convert the tree into an equivalent set of rules• For each rule

– Remove any preconditions that result in increased rule accuracy on the validation set

• Sort the rules by estimated accuracy• Classify new examples using the new ordered set of

rules

Page 26: TDIDT Learning

Discussion

• Reduced-error pruning produces the smallest version of the most accurate subtree

• Rule post-pruning is more fine-grained and possibly the most used method

• In all cases, pruning based on a validation set is problematic when the amount of available data is limited

Page 27: TDIDT Learning

Accuracy vs Entropy

• ID3 uses entropy to build the tree and accuracy to prune it

• Why not use accuracy in the first place?– How?– How does it compare with entropy?

• Is there a way to make it work?

Page 28: TDIDT Learning

Other Issues

• The text briefly discusses the following aspects of decision tree learning:– Continuous-valued attributes– Alternative splitting criteria (e.g., for attributes

with many values)– Accounting for costs

Page 29: TDIDT Learning

Unknown Attribute Values

• Alternatives:– Remove examples with missing attribute values– Treat missing value as a distinct, special value of the

attribute– Replace missing value with most common value of the

attribute• Overall• At node n• At node n with same class label

– Use probabilities