lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...

44
CSC314 / CSC763 Introduction to Machine Learning Department of Computer Science, COMSATS Lahore Dr. Adeel Nawab

Transcript of lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...

Page 1: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

CSC314 / CSC763Introduction to Machine Learning

Department of Computer

Science, COMSATS

Lahore

Dr. Adeel Nawab

Page 2: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Decision Tree Learning

Lecture Outline:

� What are Decision Trees?

� What problems are appropriate for Decision

Trees ?

� The Basic Decision Tree Learning Algorithm:

ID3

� Entropy and Information Gain

Reading:

� Chapter 3 of Mitchell

Page 3: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What are Decision Trees?

� Decision tree learning is a method for

approximating discrete-valued target func-

tions, in which the learned function istions, in which the learned function is

represented by a decision tree.

� Learned trees can also be re-represented as

sets of if-then rules to improve human

readability.

� Most popular of inductive inference algorithms

� Successfully applied to a broad range of tasks..

Page 4: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What are Decision Trees?

� Decision trees are trees which classify

instances by testing at each node some

attribute of the instance.attribute of the instance.

� Testing starts at the root node and proceeds

downwards to a leaf node, which indicates the

classification of the instance.

� Each branch leading out of a node

corresponds to a value of the attribute being

tested at that node.

Page 5: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Decision Trees?(cont…)

� A decision tree to classify days as appropriate

for playing tennis might look like:

Page 6: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What are Decision Trees? (cont)

Note that

– each path through a decision tree forms a

conjunction of attribute tests

– the tree as a whole forms a disjunction of such

paths; i.e. a disjunction of conjunctions of

attribute tests

� Preceding example could be re-expressed as:

(Out look = Sunny ∧ Humidity = Normal)

∨ (Out look = Overcast)

∨ (Out look = Rain ∧ Wind =Weak)

Page 7: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What are Decision Trees? (cont)

� As a complex rule, such a decision tree could

be coded by hand.

However, the challenge for machine learning is� However, the challenge for machine learning is

to propose algorithms for learning decision

trees from examples.

Page 8: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What Problems are Appropriate for Decision Trees?

� There are several varieties of decision tree

learning, but in general decision tree learning is

best for problems where:best for problems where:

� Instances describable by attribute–value pairs

– usually nominal

(categorical/enumerated/discrete) attributes

with small number of discrete values, but can

be numeric (ordinal/continuous)

Page 9: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What Problems are Appropriate for Decision Trees?(cont…)

� Target function is discrete valued

– in PlayTennis example target function is boolean

– easy to extend to target functions with > 2 output– easy to extend to target functions with > 2 output

values

– harder, but possible, to extend to numeric target

functions

Disjunctive hypothesis may be required

– easy for decision trees to learn disjunctive concepts

(note such concepts were outside the hypothesis

space of the Candidate-Elimination algorithm)

Page 10: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

What Problems are Appropriate for Decision Trees?(cont…)

� Possibly noisy/incomplete training data

– robust to errors in classification of training examples

and errors in attribute values describing theseand errors in attribute values describing these

examples

� Can be trained on examples where for some

instances some attribute values are

unknown/missing

Page 11: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Sample Applications of Decision Trees?

� Decision trees have been used for: (see

http://www.rulequest.com/see5-examples.html)

Predicting Magnetic Properties of Crystals� Predicting Magnetic Properties of Crystals

� Profiling Higher-Priced Houses in Boston

� Detecting Advertisements on the Web

� Controlling a Production Process

� Diagnosing Hypothyroidism

� Assessing Credit Risk

Page 12: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

� Such problems, in which the task is to classify

examples into one of a discrete set of possible

categories, are often referred to ascategories, are often referred to as

classification problems.

Page 13: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Sample Applications of Decision Trees? (cont)� Assessing Credit Risk

Page 14: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Sample Applications of Decision Trees? (cont)

� – From 490 cases like this, split 44%/56%

between accept/reject, See5 derived twelve

rules.rules.

� – On a further 200 unseen cases, these rules

give a classification accuracy of 83%

Page 15: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

ID3 Algorithm

� ID3, learns decision trees by constructing them

top- down, beginning with the question

– "which attribute should be tested at the root of the – "which attribute should be tested at the root of the

tree?‘

� Each instance attribute is evaluated using a

statistical test to determine how well it alone

classifies the training examples.

Page 16: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

ID3 Algorithm Cont….

� The best attribute is selected and used as the

test at the root node of the tree.

A descendant of the root node is then created � A descendant of the root node is then created

for each possible value of this attribute, and the

training examples are sorted to the appropriate

descendant node (i.e., down the branch

corresponding to the example's value for this

attribute).

Page 17: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

ID3 Algorithm Cont….

� The entire process is then repeated using the training

examples associated with each descendant node to

select the best attribute to test at that point in the tree. select the best attribute to test at that point in the tree.

� This process continues for each new leaf node until

either of two conditions is met:

– every attribute has already been included along this

path through the tree, or

– the training examples associated with this leaf node

all have the same target attribute value (i.e., their

entropy is zero).

Page 18: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

ID3 Algorithm Cont….

� This forms a greedy search for an acceptable

decision tree, in which the algorithm never

backtracks to reconsider earlier choices. backtracks to reconsider earlier choices.

Page 19: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

The Basic Decision Tree Learning Algorithm: ID3(Cont.)

Page 20: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

The Basic Decision Tree Learning Algorithm: ID3

Page 21: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

The Basic Decision Tree Learning Algorithm: ID3

Page 22: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Which Attribute is the BestClassifier?

� In the ID3 algorithm, choosing which attribute to

test at the next node is a crucial step.

� Would like to choose that attribute which does best � Would like to choose that attribute which does best at separating training examples according to their target classification.

An attribute which separates training examples intotwo sets each of which contains positive/negativeexamples of the target attribute in the same ratioas the initial set of examples has not helped usprogress towards a classification.

Page 23: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Which Attribute is the Best Classifier?

� Suppose we have 14 training examples, 9 +ve

and 5 -ve, of days on which tennis is played.

� For each day we have information about the

attributes humidity and wind, as below.

� Which attribute is the best classifier?

Page 24: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Entropy and Information Gain

� A useful measure of for picking the best

classifier attribute is information gain.

Information gain measures how well a given� Information gain measures how well a given

attribute separates training examples with

respect to their target classification.

� Information gain is defined in terms of entropy

as used in information theory.

Page 25: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Entropy and Information Gain(cont…)

Page 26: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Entropy

� For our previous example (14 examples, 9

positive, 5 negative):

Entropy([9+,5−]) = −p⊕ log p⊕− p⊖ log� Entropy([9+,5−]) = −p⊕ log2 p⊕− p⊖ log2

= −(9/14)log2(9/14)−(5/14)log2(5/14)

= .940

Page 27: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Entropy Cont….

� Think of Entropy(S) as expected number of bits needed

to encode class (⊕ or ⊖) of randomly drawn member

of S (under the optimal, shortest-length code)

⊕ ⊖

of S (under the optimal, shortest-length code)

� For Example

– If p⊕ = 1 (all instances are positive) then no message need

be sent (receiver knows example will be positive) and

Entropy = 0 (“pure sample”)

– If p⊕ = .5 then 1 bit need be sent to indicate whether

instance negative or positive and Entropy = 1

– If p⊕ = .8 then less than 1 bit need be sent on average –

assign shorter codes to collections of positive examples and

longer ones to negative ones

Page 28: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Entropy Cont….

� Why? – Information theory: optimal length code

assigns −log2p bits to. message having

probability p.probability p.

� So, expected number of bits needed to encode

class (⊕ or ⊖) of random member of S:

p⊕(−log2 p⊕)+ p⊖(−log2 p⊖)

Entropy(S) ≡ −p⊕ log2p⊕− p⊖ log2p⊖

Page 29: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Information Gain

� Entropy gives a measure of purity/impurity of a

set of examples.

Define information gain as the expected� Define information gain as the expected

reduction in entropy resulting from partitioning

a set of examples on the basis of an attribute.

� Formally, given a set of examples S and

attribute A:

Page 30: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Information Gain Cont….

� where

– Values(A) is the set of values attribute A can take

on

– Sv is the subset of S for which A has value v

� First term in Gain(S,A) is entropy of original set;second term is expected entropy after partitioningon A = sum of entropies of each subset Sv

weighted by ratio of Sv in S.

Page 31: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Information Gain Cont….

Page 32: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Information Gain(cont…)

Page 33: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Extended Example

Page 34: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Extended Example (cont)

Page 35: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Extended Example (cont)

� Final tree is for S is:

Page 36: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Hypothesis Space Search by ID3

� ID3 searches a space of hypotheses (set of

possible decision trees) for one fitting the

training datatraining data

� Search is simple-to-complex, hill-climbing

search guided by the information gain

evaluation function

Page 37: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Hypothesis Space Search by ID3

Page 38: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Hypothesis Space Search by ID3 (cont…)

� Hypothesis space of ID3 is complete space of

finite, discrete-valued functions w.r.t available

attributesattributes

– contrast with incomplete hypothesis spaces, such

as conjunctive hypothesis space

Page 39: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Hypothesis Space Search by ID3 (cont…)

� ID3 maintains only one hypothesis at any time,

instead of, e.g., all hypotheses consistent with

training examples seen so far

– contrast with CANDIDATE-ELIMINATION

– means can’t determine how many alternative

decision trees are consistent with data

– means can’t ask questions to resolve

competing alternatives

Page 40: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Hypothesis Space Search by ID3 (cont…)

ID3 performs no backtracking – once an attribute isselected for testing at a given node, this choice isnever reconsidered.never reconsidered.

– so, susceptible to converging to locally optimalrather than globally optimal solutions

Page 41: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Hypothesis Space Search by ID3 (cont…)

� Uses all training examples at each step to

make statistically-based decision about how to

refine current hypothesis refine current hypothesis

– contrast with CANDIDATE-ELIMINATION or FIND-S – make decisions incrementally based on single

training examples

– using statistically-based properties of all examples

(information gain) means technique is robust in the

face of errors in individual examples.

Page 42: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Summary

� Decision trees classify instances. Testing starts

at the root and proceeds downwards:

– Non-leaf nodes test one attribute of the instance– Non-leaf nodes test one attribute of the instance

and the attribute value determines which branch is

followed.

– Leaf nodes are instance classications.

Page 43: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Summary (cont…)

� Decision trees are appropriate for problems

where:

– instances are describable by attribute–value pairs– instances are describable by attribute–value pairs

(typically, but not necessarily, nominal);

– target function is discrete valued (typically, but not

necessarily);

– disjunctive hypotheses may be required;

– training data may be noisy/incomplete.

Page 44: lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5 Author: Adeel Created Date: 10/8/2013 10:19:34 AM

Summary (cont…)

� Various algorithms have been proposed to

learn decision trees – ID3 is the classic. ID3:

– recursively grows tree from the root picking at each

point attribute which maximises information gain

with respect to the training examples sorted to the

current node

recursion stops when all examples down a branch– recursion stops when all examples down a branch

fall into a single class or all attributes have been

tested

� ID3 carries out incomplete search of complete

hypothesis space – contrast with CANDIDATE-ELIMINATION which carries out a complete

search of an incomplete hypothesis space.