Decision-Tree Induction & Decision-Rule Induction

64
Decision-Tree Induction & Decision-Tree Induction & Decision-Rule Induction Decision-Rule Induction Evgueni Smirnov

description

Decision-Tree Induction & Decision-Rule Induction. Evgueni Smirnov. Overview. Instances, Classes, Languages, Hypothesis Spaces Decision Trees Decision Rules Evaluation Techniques Intro to Weka. Instances and Classes. - PowerPoint PPT Presentation

Transcript of Decision-Tree Induction & Decision-Rule Induction

Page 1: Decision-Tree Induction & Decision-Rule Induction

Decision-Tree Induction &Decision-Tree Induction &Decision-Rule InductionDecision-Rule Induction

Evgueni Smirnov

Page 2: Decision-Tree Induction & Decision-Rule Induction

OverviewOverview

1. Instances, Classes, Languages, Hypothesis Spaces

2. Decision Trees

3. Decision Rules

4. Evaluation Techniques

5. Intro to Weka

Page 3: Decision-Tree Induction & Decision-Rule Induction

Instances and ClassesInstances and Classes

friendly robots

A class is a set of objects in a world that are unified by a reason. A reason may be a similar appearance, structure or function.

Example. The set: {children, photos, cat, diplomas} can be viewed as a class “Most important things to take out of your apartment when it catches fire”.

Page 4: Decision-Tree Induction & Decision-Rule Induction

head = squarebody = roundsmiling = yesholding = flagcolor = yellow

I

Instances, Classes, LanguagesInstances, Classes, Languages

friendly robots

Page 5: Decision-Tree Induction & Decision-Rule Induction

head = squarebody = roundsmiling = yesholding = flagcolor = yellow

Li

Instances, Classes, Hypothesis SpacesInstances, Classes, Hypothesis Spaces

friendly robots

H

smiling = yes friendly robots

M

Page 6: Decision-Tree Induction & Decision-Rule Induction

I

H

I+:

I-:

M

The Classification TaskThe Classification Task

Page 7: Decision-Tree Induction & Decision-Rule Induction

Decision Trees for ClassificationDecision Trees for Classification

• Decision trees

• Appropriate problems for decision trees

• Entropy and Information Gain

• The ID3 algorithm

• Avoiding Overfitting via Pruning

• Handling Continuous-Valued Attributes

• Handling Missing Attribute Values

Page 8: Decision-Tree Induction & Decision-Rule Induction

Decision TreesDecision TreesDefinition: A decision tree is a tree s.t.:

• Each internal node tests an attribute

• Each branch corresponds to attribute value

• Each leaf node assigns a classification

Outlook

Sunny Overcast Rainy

Humidity Windy

High Normal

no

False True

yes

yes yes no

Page 9: Decision-Tree Induction & Decision-Rule Induction

Data Set for Playing TennisData Set for Playing Tennis

Outlook Temp. Humidity Windy Play

Sunny Hot High False no

Sunny Hot High True no

Overcast Hot High False yes

Rainy Mild High False yes

Rainy Cool Normal False yes

Rainy Cool Normal True no

Overcast Cool Normal True yes

Outlook Temp. Humidity Windy play

Sunny Mild High False no

Sunny Cool Normal False yes

Rainy Mild Normal False yes

Sunny Mild Normal True yes

Overcast Mild High True yes

Overcast Hot Normal False yes

Rainy Mild High True no

Page 10: Decision-Tree Induction & Decision-Rule Induction

Decision Tree For Playing TennisDecision Tree For Playing Tennis

Outlook

Sunny Overcast Rainy

Humidity Windy

High Normal

no

False True

yes

yes yes no

Page 11: Decision-Tree Induction & Decision-Rule Induction

When to Consider Decision TreesWhen to Consider Decision Trees

• Each instance consists of an attribute with discrete values (e.g. outlook/sunny, etc..)

• The classification is over discrete values (e.g. yes/no )

• It is okay to have disjunctive descriptions – each path in the tree represents a disjunction of attribute combinations. Any Boolean function can be represented!

• It is okay for the training data to contain errors – decision trees are robust to classification errors in the training data.

• It is okay for the training data to contain missing values – decision trees can be used even if instances have missing attributes.

Page 12: Decision-Tree Induction & Decision-Rule Induction

Rules in Decision TreesRules in Decision Trees

If Outlook = Sunny & Humidity = High then Play = no

If Outlook = Sunny & Humidity = Normal then Play = yes

If Outlook = Overcast then Play = yes

If Outlook = Rainy & Windy = False then Play = yes

If Outlook = Rainy & Windy = True then Play = no

Outlook

Sunny Overcast Rainy

Humidity Windy

High Normal

no

False True

yes

yes yes no

Page 13: Decision-Tree Induction & Decision-Rule Induction

Decision Tree InductionDecision Tree Induction

Basic Algorithm:

1. A the “best" decision attribute for a node N.

2. Assign A as decision attribute for the node N.

3. For each value of A, create new descendant of the node N.

4. Sort training examples to leaf nodes.

5. IF training examples perfectly classified, THEN STOP.

ELSE iterate over new leaf nodes

Page 14: Decision-Tree Induction & Decision-Rule Induction

Decision Tree InductionDecision Tree Induction

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Rain Mild High Weak yesRain Cool Normal Weak yesRain Cool Normal Strong noRain Mild Normal Weak yesRain Mild High Strong no

Outlook

____________________________________Outlook Temp Hum Wind Play -------------------------------------------------------Sunny Hot High Weak noSunny Hot High Strong noSunny Mild High Weak noSunny Cool Normal Weak yesSunny Mild Normal Strong yes

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Overcast Hot High Weak yesOvercast Cool Normal Strong yes

SunnyOvercast

Rain

Page 15: Decision-Tree Induction & Decision-Rule Induction

EntropyEntropyLet S be a sample of training examples, and

p+ is the proportion of positive examples in S and

p- is the proportion of negative examples in S.

Then:  entropy measures the impurity of S:

E( S) = - p+ log2 p+ – p- log2 p

-

Page 16: Decision-Tree Induction & Decision-Rule Induction

Entropy Example from the DatasetEntropy Example from the Dataset

2

2

9 9log 0.4114 14

5 5log 0.5314 14

( ) 0.94

yes

no

yes no

p

p

E S p p

Outlook Temp. Humidity Windy Play

Sunny Hot High False no

Sunny Hot High True no

Overcast Hot High False yes

Rainy Mild High False yes

Rainy Cool Normal False yes

Rainy Cool Normal True no

Overcast Cool Normal True yes

Outlook Temp. Humidity Windy play

Sunny Mild High False no

Sunny Cool Normal False yes

Rainy Mild Normal False yes

Sunny Mild Normal True yes

Overcast Mild High True yes

Overcast Hot Normal False yes

Rainy Mild High True no

Page 17: Decision-Tree Induction & Decision-Rule Induction

Information GainInformation Gain

Information Gain is the expected reduction in entropy caused by partitioning the instances according to a given attribute.

  Gain(S, A) = E(S) -

where Sv = { s S | A(s) = v}

)(||||

)(v

AValuesv

v SESS

S

Sv1 = { s S | A(s) = v1} Sv12 = { s S | A(s) = v2}

Page 18: Decision-Tree Induction & Decision-Rule Induction

ExampleExample

_____________________________________Outlook Temp Hum Windy Play ---------------------------------------------------------Rain Mild High False YesRain Cool Normal False YesRain Cool Normal True NoRain Mild Normal False YesRain Mild High True No

Outlook

____________________________________Outlook Temp Hum Wind Play -------------------------------------------------------Sunny Hot High False NoSunny Hot High True NoSunny Mild High False NoSunny Cool Normal False YesSunny Mild Normal True Yes

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Overcast Hot High Weak YesOvercast Cool Normal Strong Yes

SunnyOvercast

Rain

 Which attribute should be tested here?

Gain (Ssunny , Humidity) = = .970 - (3/5) 0.0 - (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570

Gain (Ssunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019

Page 19: Decision-Tree Induction & Decision-Rule Induction

ID3 AlgorithmID3 Algorithm

Informally:– Determine the attribute with the highest

information gain on the training set.– Use this attribute as the root, create a branch for

each of the values the attribute can have.– For each branch, repeat the process with subset

of the training set that is classified by that branch.

Page 20: Decision-Tree Induction & Decision-Rule Induction

Hypothesis Space Search in ID3Hypothesis Space Search in ID3

• The hypothesis space is the set of all decision trees defined over the given set of attributes.

• ID3’s hypothesis space is a compete space; i.e., the target description is there!

• ID3 performs a simple-to-complex, hill climbing search through this space.

Page 21: Decision-Tree Induction & Decision-Rule Induction

Hypothesis Space Search in ID3Hypothesis Space Search in ID3

• The evaluation function is the information gain.

• ID3 maintains only a single current decision tree.

• ID3 performs no backtracking in its search.

• ID3 uses all training instances at each step of the search.

Page 22: Decision-Tree Induction & Decision-Rule Induction

Posterior Class ProbabilitiesPosterior Class ProbabilitiesOutlook

Sunny Overcast Rainy

no: 2 pos and 3 negPpos = 0.4, Pneg = 0.6

Windy

False True

no: 2 pos and 0 negPpos = 1.0, Pneg = 0.0

no: 0 pos and 2 negPpos = 0.0, Pneg = 1.0

no: 3 pos and 0 negPpos = 1.0, Pneg = 0.0

Page 23: Decision-Tree Induction & Decision-Rule Induction

OverfittingOverfitting Definition: Given a hypothesis space H, a hypothesis h

H is said to overfit the training data if there exists some hypothesis h’ H, such that h has smaller error that h’ over the training instances, but h’ has a smaller error that h over the entire distribution of instances.

Page 24: Decision-Tree Induction & Decision-Rule Induction

Reasons for OverfittingReasons for Overfitting

• Noisy training instances. Consider an noisy training example: Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No

This instance affects the training instances: Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes

Outlook

sunny overcast rainy

Humidity Windy

high normal

no

false true

yes

yes yes no

Page 25: Decision-Tree Induction & Decision-Rule Induction

Reasons for OverfittingReasons for OverfittingOutlook

sunny overcast rainy

Humidity Windy

high normal

no

false true

yes

yes noWindy

true

yes

false

Temp

high

yes no

mild cool

?

Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = NoOutlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes

Page 26: Decision-Tree Induction & Decision-Rule Induction

area with probablywrong predictions

+

++

++ +

+

-

-

- -

-

---

---

-

-

- +

---

-

-

Reasons for OverfittingReasons for Overfitting• Small number of instances are associated with leaf nodes. In this case it is possible that for coincidental regularities to occur that are unrelated to the actual target concept.

Page 27: Decision-Tree Induction & Decision-Rule Induction

Approaches to Avoiding OverfittingApproaches to Avoiding Overfitting

• Pre-pruning: stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data

• Post-pruning: Allow the tree to overfit the data, and then post-prune the tree.

Page 28: Decision-Tree Induction & Decision-Rule Induction

Pre-pruning

Outlook

Sunny Overcast Rainy

Humidity Windy

High Normal

no

False True

yes

yes yes no

Outlook

Sunny Overcast Rainy

noyes?

• It is difficult to decide when to stop growing the tree.

• A possible scenario is to stop when the leaf nodes gets less than m training instances. Here is an example for m = 5.

3 2

2

23

Page 29: Decision-Tree Induction & Decision-Rule Induction

Validation SetValidation Set

• Validation set is a set of instances used to evaluate the utility of nodes in decision trees. The validation set has to be chosen so that it is unlikely to suffer from same errors or fluctuations as the training set.

• Usually before pruning the training data is split randomly into a growing set and a validation set.

Page 30: Decision-Tree Induction & Decision-Rule Induction

Reduced-ErrorReduced-Error PruningPruningSplit data into growing and

validation sets.

Pruning a decision node d consists of:1. removing the subtree rooted at d.2. making d a leaf node. 3. assigning d the most common

classification of the training instances associated with d.

Outlook

sunny overcast rainy

Humidity Windy

high normal

no

false true

yes

yes yes no

3 instances 2 instances

Accuracy of the tree on the validation set is 90%.

Page 31: Decision-Tree Induction & Decision-Rule Induction

Reduced-Error PruningReduced-Error PruningSplit data into growing and

validation sets.

Pruning a decision node d consists of:1. removing the subtree rooted at d.2. making d a leaf node. 3. assigning d the most common

classification of the training instances associated with d.

Outlook

sunny overcast rainy

Windyno

false true

yes

yes no

Accuracy of the tree on the validation set is 92.4%.

Page 32: Decision-Tree Induction & Decision-Rule Induction

Reduced-Error PruningReduced-Error PruningSplit data into growing and validation

sets.

Pruning a decision node d consists of:1. removing the subtree rooted at d.2. making d a leaf node. 3. assigning d the most common

classification of the training instances associated with d.

Do until further pruning is harmful:1. Evaluate impact on validation set of

pruning each possible node (plus those below it).

2. Greedily remove the one that most improves validation set accuracy.

Outlook

sunny overcast rainy

Windyno

false true

yes

yes no

Accuracy of the tree on the validation set is 92.4%.

Page 33: Decision-Tree Induction & Decision-Rule Induction

Reduced Error Pruning ExampleReduced Error Pruning Example

Page 34: Decision-Tree Induction & Decision-Rule Induction

Rule Post-PruningRule Post-Pruning

IF (Outlook = Sunny) & (Humidity = High)THEN PlayTennis = NoIF (Outlook = Sunny) & (Humidity = Normal)THEN PlayTennis = Yes……….

1. Convert tree to equivalent set of rules.2. Prune each rule independently of others.3. Sort final rules by their estimated accuracy, and consider them

in this sequence when classifying subsequent instances.

Outlook

sunny overcast rainy

Humidity Windy

normal

no

false true

yes

yes yes no

Page 35: Decision-Tree Induction & Decision-Rule Induction

Continuous Valued AttributesContinuous Valued Attributes

1. Create a set of discrete attributes to test continuous.

2. Apply Information Gain in order to choose the best attribute.

Temperature: 40 48 60 72 80 90

PlayTennis: No No Yes Yes Yes No

Temp>54 Tem>85

Page 36: Decision-Tree Induction & Decision-Rule Induction

Missing Attribute ValuesMissing Attribute Values

Strategies:

1. Assign most common value of A among other instances belonging to the same concept.

2. If node n tests the attribute A, assign most common value of A among other instances sorted to node n.

3. If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are estimated based on the observed frequencies of the values of A. These probabilities are used in the information gain measure (via info gain) ( ).)(

||||

)(v

AValuesv

v SESS

Page 37: Decision-Tree Induction & Decision-Rule Induction

Summary PointsSummary Points

1. Decision tree learning provides a practical method for concept learning.

2. ID3-like algorithms search complete hypothesis space.3. The inductive bias of decision trees is preference (search)

bias.4. Overfitting the training data is an important issue in

decision tree learning.5. A large number of extensions of the ID3 algorithm have

been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc.

Page 38: Decision-Tree Induction & Decision-Rule Induction

Learning Decision RulesLearning Decision Rules

• Decision Rules• Basic Sequential Covering Algorithm• Learn-One-Rule Procedure• Pruning

Page 39: Decision-Tree Induction & Decision-Rule Induction

Definition of Decision RulesDefinition of Decision Rules

Example: If you run the Prism algorithm from Weka on the weather data you will get the following set of decision rules:

if outlook = overcast then PlayTennis = yes

if humidity = normal and windy = FALSE then PlayTennis = yes

if temperature = mild and humidity = normal then PlayTennis = yes

if outlook = rainy and windy = FALSE then PlayTennis = yes

if outlook = sunny and humidity = high then PlayTennis = no

if outlook = rainy and windy = TRUE then PlayTennis = no

Definition: Decision rules are rules with the following form:

if <conditions> then class C.

Page 40: Decision-Tree Induction & Decision-Rule Induction

Why Decision Rules?Why Decision Rules?• Decision rules are more compact.• Decision rules are more understandable.

Example: Let X {0,1}, Y {0,1}, Z {0,1}, W {0,1}. The rules are:

if X=1 and Y=1 then 1

if Z=1 and W=1 then 1

Otherwise 0;

X

0

Y

1 0

1 Z

1 0

0W

1 0

1 0

Z

1 0

0W

1 0

1 0

1

Page 41: Decision-Tree Induction & Decision-Rule Induction

Why Decision Rules?Why Decision Rules?

+ +

++ ++

+

+++ +

+ -

-

-

- -- -

-

-

--

--

Decision boundaries of decision trees

+ +

++ ++

+

+++ +

+ -

-

-

- -- -

-

-

--

--

Decision boundaries of decision rules

Page 42: Decision-Tree Induction & Decision-Rule Induction

How to Learn Decision Rules?How to Learn Decision Rules?

1. We can convert trees to rules2. We can use specific rule-learning methods

Page 43: Decision-Tree Induction & Decision-Rule Induction

Sequential Covering AlgorithmsSequential Covering Algorithmsfunction LearnRuleSet(Target, Attrs, Examples, Threshold):

LearnedRules :=

Rule := LearnOneRule(Target, Attrs, Examples)

while performance(Rule,Examples) > Threshold, do

LearnedRules := LearnedRules {Rule}

Examples := Examples \ {examples covered by Rule}

Rule := LearnOneRule(Target, Attrs, Examples)

sort LearnedRules according to performance

return LearnedRules

Page 44: Decision-Tree Induction & Decision-Rule Induction

IF true THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

Page 45: Decision-Tree Induction & Decision-Rule Induction

IF true THEN posIF A THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

Page 46: Decision-Tree Induction & Decision-Rule Induction

IF true THEN posIF A THEN pos IF A & B THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

Page 47: Decision-Tree Induction & Decision-Rule Induction

IF true THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF A & B THEN pos

Page 48: Decision-Tree Induction & Decision-Rule Induction

IF true THEN posIF C THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF A & B THEN pos

Page 49: Decision-Tree Induction & Decision-Rule Induction

IF true THEN posIF C THEN posIF C & D THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF A & B THEN pos

Page 50: Decision-Tree Induction & Decision-Rule Induction

Learning One RuleLearning One Rule

To learn one rule we use one of the strategies below:• Top-down:

– Start with maximally general rule– Add literals one by one

• Bottom-up:– Start with maximally specific rule– Remove literals one by one

• Combination of top-down and bottom-up: – Candidate-elimination algorithm.

Page 51: Decision-Tree Induction & Decision-Rule Induction

Bottom-up vs. Top-downBottom-up vs. Top-down

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

Top-down: typically more general rules

Bottom-up: typically more specific rules

Page 52: Decision-Tree Induction & Decision-Rule Induction

Learning One RuleLearning One Rule

Bottom-up:• Example-driven (AQ family).

Top-down:• Generate-then-Test (CN-2).

Page 53: Decision-Tree Induction & Decision-Rule Induction

Example of Learning One RuleExample of Learning One Rule

Page 54: Decision-Tree Induction & Decision-Rule Induction

Heuristics for Learning One RuleHeuristics for Learning One Rule

– When is a rule “good”?• High accuracy;• Less important: high coverage.

– Possible evaluation functions:• Relative frequency: nc/n, where nc is the number of correctly

classified instances, and n is the number of instances covered by the rule;

• m-estimate of accuracy: (nc+ mp)/(n+m), where nc is the number of correctly classified instances, n is the number of instances covered by the rule, p is the prior probablity of the class predicted by the rule, and m is the weight of p.

• Entropy.

Page 55: Decision-Tree Induction & Decision-Rule Induction

How to Arrange the RulesHow to Arrange the Rules 1. The rules are ordered according to the order they have been

learned. This order is used for instance classification.

2. The rules are ordered according to their accuracy. This order is used for instance classification.

3. The rules are not ordered but there exists a strategy how to apply the rules (e.g., an instance covered by conflicting rules gets the classification of the rule that classifies correctly more training instances; if an instance is not covered by any rule, then it gets the classification of the majority class represented in the training data).

Page 56: Decision-Tree Induction & Decision-Rule Induction

Approaches to Avoiding OverfittingApproaches to Avoiding Overfitting

• Pre-pruning: stop learning the decision rules before they reach the point where they perfectly classify the training data

• Post-pruning: allow the decision rules to overfit the training data, and then post-prune the rules.

Page 57: Decision-Tree Induction & Decision-Rule Induction

Post-PruningPost-Pruning

1. Split instances into Growing Set and Pruning Set;

2. Learn set SR of rules using Growing Set;

3. Find the best simplification BSR of SR.

4. while (Accuracy(BSR, Pruning Set) >

Accuracy(SR, Pruning Set) ) do

4.1 SR = BSR;

4.2 Find the best simplification BSR of SR.

5. return BSR;

Page 58: Decision-Tree Induction & Decision-Rule Induction

Incremental Reduced Error PruningIncremental Reduced Error Pruning

D1

D2

D3

D3

D22

D1 D21

Post-pruning

Page 59: Decision-Tree Induction & Decision-Rule Induction

Incremental Reduced Error PruningIncremental Reduced Error Pruning

1. Split Training Set into Growing Set and Validation Set;

2. Learn rule R using Growing Set;

3. Prune the rule R using Validation Set;

4. if performance(R, Training Set) > Threshold

4.1 Add R to Set of Learned Rules

4.2 Remove in Training Set the instances covered by R;

4.2 go to 1;

5. else return Set of Learned Rules

Page 60: Decision-Tree Induction & Decision-Rule Induction

Summary PointsSummary Points

1. Decision rules are easier for human comprehension than decision trees.

2. Decision rules have simpler decision boundaries than decision trees.

3. Decision rules are learned by sequential covering of the training instances.

Page 61: Decision-Tree Induction & Decision-Rule Induction

Model Evaluation Techniques

• Evaluation on the training set: too optimistic

Training set

Classifier

Training set

Page 62: Decision-Tree Induction & Decision-Rule Induction

Model Evaluation Techniques

• Hold-out Method: depends on the make-up of the test set.

Training set

Classifier

Test set

Data

• To improve the precision of the hold-out method: it is repeated many times.

Page 63: Decision-Tree Induction & Decision-Rule Induction

Model Evaluation Techniques

• k-fold Cross Validation

Classifier

Data

train train test

train test train

test train train

Page 64: Decision-Tree Induction & Decision-Rule Induction

Intro to WekaIntro to Weka@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {TRUE, FALSE}

@datasunny,hot,high,FALSE,FALSEsunny,hot,high,TRUE,FALSEovercast,hot,high,FALSE,TRUErainy,mild,high,FALSE,TRUErainy,cool,normal,FALSE,TRUErainy,cool,normal,TRUE,FALSEovercast,cool,normal,TRUE,TRUE………….