From the desk of Adeel Durvesh LECTURE 11 FINANCIAL REORGANIZATION MERGERS & ACQUISITIONS.
lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...
Transcript of lecture 5 - machinelearningsp14.files.wordpress.com€¦ · Title: Microsoft PowerPoint - lecture 5...
CSC314 / CSC763Introduction to Machine Learning
Department of Computer
Science, COMSATS
Lahore
Dr. Adeel Nawab
Decision Tree Learning
Lecture Outline:
� What are Decision Trees?
� What problems are appropriate for Decision
Trees ?
� The Basic Decision Tree Learning Algorithm:
ID3
� Entropy and Information Gain
Reading:
� Chapter 3 of Mitchell
What are Decision Trees?
� Decision tree learning is a method for
approximating discrete-valued target func-
tions, in which the learned function istions, in which the learned function is
represented by a decision tree.
� Learned trees can also be re-represented as
sets of if-then rules to improve human
readability.
� Most popular of inductive inference algorithms
� Successfully applied to a broad range of tasks..
What are Decision Trees?
� Decision trees are trees which classify
instances by testing at each node some
attribute of the instance.attribute of the instance.
� Testing starts at the root node and proceeds
downwards to a leaf node, which indicates the
classification of the instance.
� Each branch leading out of a node
corresponds to a value of the attribute being
tested at that node.
Decision Trees?(cont…)
� A decision tree to classify days as appropriate
for playing tennis might look like:
What are Decision Trees? (cont)
Note that
– each path through a decision tree forms a
conjunction of attribute tests
– the tree as a whole forms a disjunction of such
paths; i.e. a disjunction of conjunctions of
attribute tests
� Preceding example could be re-expressed as:
(Out look = Sunny ∧ Humidity = Normal)
∨ (Out look = Overcast)
∨ (Out look = Rain ∧ Wind =Weak)
What are Decision Trees? (cont)
� As a complex rule, such a decision tree could
be coded by hand.
However, the challenge for machine learning is� However, the challenge for machine learning is
to propose algorithms for learning decision
trees from examples.
What Problems are Appropriate for Decision Trees?
� There are several varieties of decision tree
learning, but in general decision tree learning is
best for problems where:best for problems where:
� Instances describable by attribute–value pairs
– usually nominal
(categorical/enumerated/discrete) attributes
with small number of discrete values, but can
be numeric (ordinal/continuous)
What Problems are Appropriate for Decision Trees?(cont…)
� Target function is discrete valued
– in PlayTennis example target function is boolean
– easy to extend to target functions with > 2 output– easy to extend to target functions with > 2 output
values
– harder, but possible, to extend to numeric target
functions
Disjunctive hypothesis may be required
– easy for decision trees to learn disjunctive concepts
(note such concepts were outside the hypothesis
space of the Candidate-Elimination algorithm)
What Problems are Appropriate for Decision Trees?(cont…)
� Possibly noisy/incomplete training data
– robust to errors in classification of training examples
and errors in attribute values describing theseand errors in attribute values describing these
examples
� Can be trained on examples where for some
instances some attribute values are
unknown/missing
Sample Applications of Decision Trees?
� Decision trees have been used for: (see
http://www.rulequest.com/see5-examples.html)
Predicting Magnetic Properties of Crystals� Predicting Magnetic Properties of Crystals
� Profiling Higher-Priced Houses in Boston
� Detecting Advertisements on the Web
� Controlling a Production Process
� Diagnosing Hypothyroidism
� Assessing Credit Risk
� Such problems, in which the task is to classify
examples into one of a discrete set of possible
categories, are often referred to ascategories, are often referred to as
classification problems.
Sample Applications of Decision Trees? (cont)� Assessing Credit Risk
Sample Applications of Decision Trees? (cont)
� – From 490 cases like this, split 44%/56%
between accept/reject, See5 derived twelve
rules.rules.
� – On a further 200 unseen cases, these rules
give a classification accuracy of 83%
ID3 Algorithm
� ID3, learns decision trees by constructing them
top- down, beginning with the question
– "which attribute should be tested at the root of the – "which attribute should be tested at the root of the
tree?‘
� Each instance attribute is evaluated using a
statistical test to determine how well it alone
classifies the training examples.
ID3 Algorithm Cont….
� The best attribute is selected and used as the
test at the root node of the tree.
A descendant of the root node is then created � A descendant of the root node is then created
for each possible value of this attribute, and the
training examples are sorted to the appropriate
descendant node (i.e., down the branch
corresponding to the example's value for this
attribute).
ID3 Algorithm Cont….
� The entire process is then repeated using the training
examples associated with each descendant node to
select the best attribute to test at that point in the tree. select the best attribute to test at that point in the tree.
� This process continues for each new leaf node until
either of two conditions is met:
– every attribute has already been included along this
path through the tree, or
– the training examples associated with this leaf node
all have the same target attribute value (i.e., their
entropy is zero).
ID3 Algorithm Cont….
� This forms a greedy search for an acceptable
decision tree, in which the algorithm never
backtracks to reconsider earlier choices. backtracks to reconsider earlier choices.
The Basic Decision Tree Learning Algorithm: ID3(Cont.)
The Basic Decision Tree Learning Algorithm: ID3
The Basic Decision Tree Learning Algorithm: ID3
Which Attribute is the BestClassifier?
� In the ID3 algorithm, choosing which attribute to
test at the next node is a crucial step.
� Would like to choose that attribute which does best � Would like to choose that attribute which does best at separating training examples according to their target classification.
An attribute which separates training examples intotwo sets each of which contains positive/negativeexamples of the target attribute in the same ratioas the initial set of examples has not helped usprogress towards a classification.
Which Attribute is the Best Classifier?
� Suppose we have 14 training examples, 9 +ve
and 5 -ve, of days on which tennis is played.
� For each day we have information about the
attributes humidity and wind, as below.
� Which attribute is the best classifier?
Entropy and Information Gain
� A useful measure of for picking the best
classifier attribute is information gain.
Information gain measures how well a given� Information gain measures how well a given
attribute separates training examples with
respect to their target classification.
� Information gain is defined in terms of entropy
as used in information theory.
Entropy and Information Gain(cont…)
Entropy
� For our previous example (14 examples, 9
positive, 5 negative):
Entropy([9+,5−]) = −p⊕ log p⊕− p⊖ log� Entropy([9+,5−]) = −p⊕ log2 p⊕− p⊖ log2
= −(9/14)log2(9/14)−(5/14)log2(5/14)
= .940
Entropy Cont….
� Think of Entropy(S) as expected number of bits needed
to encode class (⊕ or ⊖) of randomly drawn member
of S (under the optimal, shortest-length code)
⊕ ⊖
of S (under the optimal, shortest-length code)
� For Example
– If p⊕ = 1 (all instances are positive) then no message need
be sent (receiver knows example will be positive) and
Entropy = 0 (“pure sample”)
– If p⊕ = .5 then 1 bit need be sent to indicate whether
instance negative or positive and Entropy = 1
– If p⊕ = .8 then less than 1 bit need be sent on average –
assign shorter codes to collections of positive examples and
longer ones to negative ones
Entropy Cont….
� Why? – Information theory: optimal length code
assigns −log2p bits to. message having
probability p.probability p.
� So, expected number of bits needed to encode
class (⊕ or ⊖) of random member of S:
p⊕(−log2 p⊕)+ p⊖(−log2 p⊖)
Entropy(S) ≡ −p⊕ log2p⊕− p⊖ log2p⊖
Information Gain
� Entropy gives a measure of purity/impurity of a
set of examples.
Define information gain as the expected� Define information gain as the expected
reduction in entropy resulting from partitioning
a set of examples on the basis of an attribute.
� Formally, given a set of examples S and
attribute A:
Information Gain Cont….
� where
– Values(A) is the set of values attribute A can take
on
– Sv is the subset of S for which A has value v
� First term in Gain(S,A) is entropy of original set;second term is expected entropy after partitioningon A = sum of entropies of each subset Sv
weighted by ratio of Sv in S.
Information Gain Cont….
Information Gain(cont…)
Extended Example
Extended Example (cont)
Extended Example (cont)
� Final tree is for S is:
Hypothesis Space Search by ID3
� ID3 searches a space of hypotheses (set of
possible decision trees) for one fitting the
training datatraining data
� Search is simple-to-complex, hill-climbing
search guided by the information gain
evaluation function
Hypothesis Space Search by ID3
Hypothesis Space Search by ID3 (cont…)
� Hypothesis space of ID3 is complete space of
finite, discrete-valued functions w.r.t available
attributesattributes
– contrast with incomplete hypothesis spaces, such
as conjunctive hypothesis space
Hypothesis Space Search by ID3 (cont…)
� ID3 maintains only one hypothesis at any time,
instead of, e.g., all hypotheses consistent with
training examples seen so far
– contrast with CANDIDATE-ELIMINATION
– means can’t determine how many alternative
decision trees are consistent with data
– means can’t ask questions to resolve
competing alternatives
Hypothesis Space Search by ID3 (cont…)
ID3 performs no backtracking – once an attribute isselected for testing at a given node, this choice isnever reconsidered.never reconsidered.
– so, susceptible to converging to locally optimalrather than globally optimal solutions
Hypothesis Space Search by ID3 (cont…)
� Uses all training examples at each step to
make statistically-based decision about how to
refine current hypothesis refine current hypothesis
– contrast with CANDIDATE-ELIMINATION or FIND-S – make decisions incrementally based on single
training examples
– using statistically-based properties of all examples
(information gain) means technique is robust in the
face of errors in individual examples.
Summary
� Decision trees classify instances. Testing starts
at the root and proceeds downwards:
– Non-leaf nodes test one attribute of the instance– Non-leaf nodes test one attribute of the instance
and the attribute value determines which branch is
followed.
– Leaf nodes are instance classications.
Summary (cont…)
� Decision trees are appropriate for problems
where:
– instances are describable by attribute–value pairs– instances are describable by attribute–value pairs
(typically, but not necessarily, nominal);
– target function is discrete valued (typically, but not
necessarily);
– disjunctive hypotheses may be required;
– training data may be noisy/incomplete.
Summary (cont…)
� Various algorithms have been proposed to
learn decision trees – ID3 is the classic. ID3:
– recursively grows tree from the root picking at each
point attribute which maximises information gain
with respect to the training examples sorted to the
current node
recursion stops when all examples down a branch– recursion stops when all examples down a branch
fall into a single class or all attributes have been
tested
� ID3 carries out incomplete search of complete
hypothesis space – contrast with CANDIDATE-ELIMINATION which carries out a complete
search of an incomplete hypothesis space.