lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...

CSC314 / CSC763Introduction to Machine Learning

Department of Computer

Science, COMSATS

Lahore

Dr. Adeel Nawab

Decision Tree Learning

Lecture Outline:

� What are Decision Trees?

� What problems are appropriate for Decision

Trees ?

� The Basic Decision Tree Learning Algorithm:

ID3

� Entropy and Information Gain

Reading:

� Chapter 3 of Mitchell

What are Decision Trees?

� Decision tree learning is a method for

approximating discrete-valued target func-

tions, in which the learned function istions, in which the learned function is

represented by a decision tree.

� Learned trees can also be re-represented as

sets of if-then rules to improve human

readability.

� Most popular of inductive inference algorithms

� Successfully applied to a broad range of tasks..

What are Decision Trees?

� Decision trees are trees which classify

instances by testing at each node some

attribute of the instance.attribute of the instance.

� Testing starts at the root node and proceeds

downwards to a leaf node, which indicates the

classification of the instance.

� Each branch leading out of a node

corresponds to a value of the attribute being

tested at that node.

Decision Trees?(cont…)

� A decision tree to classify days as appropriate

for playing tennis might look like:

What are Decision Trees? (cont)

Note that

– each path through a decision tree forms a

conjunction of attribute tests

– the tree as a whole forms a disjunction of such

paths; i.e. a disjunction of conjunctions of

attribute tests

� Preceding example could be re-expressed as:

(Out look = Sunny ∧ Humidity = Normal)

∨ (Out look = Overcast)

∨ (Out look = Rain ∧ Wind =Weak)

What are Decision Trees? (cont)

� As a complex rule, such a decision tree could

be coded by hand.

However, the challenge for machine learning is� However, the challenge for machine learning is

to propose algorithms for learning decision

trees from examples.

What Problems are Appropriate for Decision Trees?

� There are several varieties of decision tree

learning, but in general decision tree learning is

best for problems where:best for problems where:

� Instances describable by attribute–value pairs

– usually nominal

(categorical/enumerated/discrete) attributes

with small number of discrete values, but can

be numeric (ordinal/continuous)

What Problems are Appropriate for Decision Trees?(cont…)

� Target function is discrete valued

– in PlayTennis example target function is boolean

– easy to extend to target functions with > 2 output– easy to extend to target functions with > 2 output

values

– harder, but possible, to extend to numeric target

functions

Disjunctive hypothesis may be required

– easy for decision trees to learn disjunctive concepts

(note such concepts were outside the hypothesis

space of the Candidate-Elimination algorithm)

What Problems are Appropriate for Decision Trees?(cont…)

� Possibly noisy/incomplete training data

– robust to errors in classification of training examples

and errors in attribute values describing theseand errors in attribute values describing these

examples

� Can be trained on examples where for some

instances some attribute values are

unknown/missing

Sample Applications of Decision Trees?

� Decision trees have been used for: (see

http://www.rulequest.com/see5-examples.html)

Predicting Magnetic Properties of Crystals� Predicting Magnetic Properties of Crystals

� Profiling Higher-Priced Houses in Boston

� Detecting Advertisements on the Web

� Controlling a Production Process

� Diagnosing Hypothyroidism

� Assessing Credit Risk

� Such problems, in which the task is to classify

examples into one of a discrete set of possible

categories, are often referred to ascategories, are often referred to as

classification problems.

Sample Applications of Decision Trees? (cont)� Assessing Credit Risk

Sample Applications of Decision Trees? (cont)

� – From 490 cases like this, split 44%/56%

between accept/reject, See5 derived twelve

rules.rules.

� – On a further 200 unseen cases, these rules

give a classification accuracy of 83%

ID3 Algorithm

� ID3, learns decision trees by constructing them

top- down, beginning with the question

– "which attribute should be tested at the root of the – "which attribute should be tested at the root of the

tree?‘

� Each instance attribute is evaluated using a

statistical test to determine how well it alone

classifies the training examples.

ID3 Algorithm Cont….

� The best attribute is selected and used as the

test at the root node of the tree.

A descendant of the root node is then created � A descendant of the root node is then created

for each possible value of this attribute, and the

training examples are sorted to the appropriate

descendant node (i.e., down the branch

corresponding to the example's value for this

attribute).


� The entire process is then repeated using the training

examples associated with each descendant node to

select the best attribute to test at that point in the tree. select the best attribute to test at that point in the tree.

� This process continues for each new leaf node until

either of two conditions is met:

– every attribute has already been included along this

path through the tree, or

– the training examples associated with this leaf node

all have the same target attribute value (i.e., their

entropy is zero).


� This forms a greedy search for an acceptable

decision tree, in which the algorithm never

backtracks to reconsider earlier choices. backtracks to reconsider earlier choices.

The Basic Decision Tree Learning Algorithm: ID3(Cont.)

The Basic Decision Tree Learning Algorithm: ID3

Which Attribute is the BestClassifier?

� In the ID3 algorithm, choosing which attribute to

test at the next node is a crucial step.

� Would like to choose that attribute which does best � Would like to choose that attribute which does best at separating training examples according to their target classification.

An attribute which separates training examples intotwo sets each of which contains positive/negativeexamples of the target attribute in the same ratioas the initial set of examples has not helped usprogress towards a classification.

Which Attribute is the Best Classifier?

� Suppose we have 14 training examples, 9 +ve

and 5 -ve, of days on which tennis is played.

� For each day we have information about the

attributes humidity and wind, as below.

� Which attribute is the best classifier?

Entropy and Information Gain

� A useful measure of for picking the best

classifier attribute is information gain.

Information gain measures how well a given� Information gain measures how well a given

attribute separates training examples with

respect to their target classification.

� Information gain is defined in terms of entropy

as used in information theory.

Entropy and Information Gain(cont…)

Entropy

� For our previous example (14 examples, 9

positive, 5 negative):

Entropy([9+,5−]) = −p⊕ log p⊕− p⊖ log� Entropy([9+,5−]) = −p⊕ log2 p⊕− p⊖ log2

= −(9/14)log2(9/14)−(5/14)log2(5/14)

= .940

Entropy Cont….

� Think of Entropy(S) as expected number of bits needed

to encode class (⊕ or ⊖) of randomly drawn member

of S (under the optimal, shortest-length code)

⊕ ⊖

of S (under the optimal, shortest-length code)

� For Example

– If p⊕ = 1 (all instances are positive) then no message need

be sent (receiver knows example will be positive) and

Entropy = 0 (“pure sample”)

– If p⊕ = .5 then 1 bit need be sent to indicate whether

instance negative or positive and Entropy = 1

– If p⊕ = .8 then less than 1 bit need be sent on average –

assign shorter codes to collections of positive examples and

longer ones to negative ones

Entropy Cont….

� Why? – Information theory: optimal length code

assigns −log2p bits to. message having

probability p.probability p.

� So, expected number of bits needed to encode

class (⊕ or ⊖) of random member of S:

p⊕(−log2 p⊕)+ p⊖(−log2 p⊖)

Entropy(S) ≡ −p⊕ log2p⊕− p⊖ log2p⊖

Information Gain

� Entropy gives a measure of purity/impurity of a

set of examples.

Define information gain as the expected� Define information gain as the expected

reduction in entropy resulting from partitioning

a set of examples on the basis of an attribute.

� Formally, given a set of examples S and

attribute A:

Information Gain Cont….

� where

– Values(A) is the set of values attribute A can take

on

– Sv is the subset of S for which A has value v

� First term in Gain(S,A) is entropy of original set;second term is expected entropy after partitioningon A = sum of entropies of each subset Sv

weighted by ratio of Sv in S.

Information Gain Cont….

Information Gain(cont…)

Extended Example

Extended Example (cont)

Extended Example (cont)

� Final tree is for S is:

Hypothesis Space Search by ID3

� ID3 searches a space of hypotheses (set of

possible decision trees) for one fitting the

training datatraining data

� Search is simple-to-complex, hill-climbing

search guided by the information gain

evaluation function

Hypothesis Space Search by ID3

Hypothesis Space Search by ID3 (cont…)

� Hypothesis space of ID3 is complete space of

finite, discrete-valued functions w.r.t available

attributesattributes

– contrast with incomplete hypothesis spaces, such

as conjunctive hypothesis space


� ID3 maintains only one hypothesis at any time,

instead of, e.g., all hypotheses consistent with

training examples seen so far

– contrast with CANDIDATE-ELIMINATION

– means can’t determine how many alternative

decision trees are consistent with data

– means can’t ask questions to resolve

competing alternatives


ID3 performs no backtracking – once an attribute isselected for testing at a given node, this choice isnever reconsidered.never reconsidered.

– so, susceptible to converging to locally optimalrather than globally optimal solutions


� Uses all training examples at each step to

make statistically-based decision about how to

refine current hypothesis refine current hypothesis

– contrast with CANDIDATE-ELIMINATION or FIND-S – make decisions incrementally based on single

training examples

– using statistically-based properties of all examples

(information gain) means technique is robust in the

face of errors in individual examples.

Summary

� Decision trees classify instances. Testing starts

at the root and proceeds downwards:

– Non-leaf nodes test one attribute of the instance– Non-leaf nodes test one attribute of the instance

and the attribute value determines which branch is

followed.

– Leaf nodes are instance classications.

Summary (cont…)

� Decision trees are appropriate for problems

where:

– instances are describable by attribute–value pairs– instances are describable by attribute–value pairs

(typically, but not necessarily, nominal);

– target function is discrete valued (typically, but not

necessarily);

– disjunctive hypotheses may be required;

– training data may be noisy/incomplete.

Summary (cont…)

� Various algorithms have been proposed to

learn decision trees – ID3 is the classic. ID3:

– recursively grows tree from the root picking at each

point attribute which maximises information gain

with respect to the training examples sorted to the

current node

recursion stops when all examples down a branch– recursion stops when all examples down a branch

fall into a single class or all attributes have been

tested

� ID3 carries out incomplete search of complete

hypothesis space – contrast with CANDIDATE-ELIMINATION which carries out a complete

search of an incomplete hypothesis space.

lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...

Documents

Transcript of lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...