Lecture5 - C4.5

27
Introduction to Machine Learning Lecture 5 Albert Orriols i Puig [email protected] Artificial Intelligence Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of Lecture5 - C4.5

Page 1: Lecture5 - C4.5

Introduction to Machine Learning

Lecture 5

Albert Orriols i [email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Page 2: Lecture5 - C4.5

Recap of Lecture 4

The input consists of examples featured by different characteristics

Slide 2Artificial Intelligence Machine Learning

Page 3: Lecture5 - C4.5

Recap of Lecture 4Different problems in machine learning

Classification: Find the class to which a new instance belongs toClassification: Find the class to which a new instance belongs toE.g.: Find whether a new patient has cancer or not

Numeric prediction: A variation of classification in which the output p pconsists of numeric classes

E.g.: Find the frequency of cancerous cell found

Regression: Find a function that fits your examplesE.g.: Find a function that controls your chain process

Association: Find association among your problem attributes orAssociation: Find association among your problem attributes or variables

E.g.: Find relations such as a patient with high-blood-pressure is lik l t h h t tt k dimore likely to have heart-attack disease

Clustering: Process to cluster/group the instances into classesE g : Group clients whose purchases are similar

Slide 3

E.g.: Group clients whose purchases are similar

Artificial Intelligence Machine Learning

Page 4: Lecture5 - C4.5

Today’s Agenda

Reviewing the Goal of Data ClassificationReviewing the Goal of Data ClassificationWhat’s a Decision TreeHow to build Decision Trees:

ID3ID3From ID3 to C4.5

Run C4.5 on Weka

Slide 4Artificial Intelligence Machine Learning

Page 5: Lecture5 - C4.5

The Goal of Data Classification

Data Set Classification Model How?How?

The classification model can be implemented in several ways:• Rules• Decision trees• Mathematical formulae

Slide 5Artificial Intelligence Machine Learning

Page 6: Lecture5 - C4.5

The Goal of Data Classification

Data can have complex structures

We will accept the following type of data:Data described by features which have a single measurement

Features can beNominal: @attribute color {green, orange, red}Continuous: @attribute length real [0,10]

I can have unknown valuesI could have lost – or never have measured – the attribute of a particular example

Slide 6Artificial Intelligence Machine Learning

Page 7: Lecture5 - C4.5

Classifying PlantsLet’s classify different plants in three classes:

Iris-setosa, iris-virginica, and iris-versicolor

Weka format

Slide 7

Weka formatDataset publically available at UCI repository

Artificial Intelligence Machine Learning

Page 8: Lecture5 - C4.5

Classifying PlantsDecision tree form this output

Internal node: test of the value of a given attribute

Branch: value of anBranch: value of an attribute

Leaf: predicted class

How could I automatically generate these types of trees?

Slide 8Artificial Intelligence Machine Learning

Page 9: Lecture5 - C4.5

Types of DT BuildersSo, where is the trick?

Chose the attribute in each internal node and chose the best partition

S l l ith bl t t d i i tSeveral algorithms able to generate decision trees:Naïve Bayes Tree

Random forests

CART (classification and regression tree)

ID3

C45

We are going to start from ID3 and finish with C4.5C4.5 is the most influential decision tree builder algorithm

Slide 9

C 5 s t e ost ue t a dec s o t ee bu de a go t

Artificial Intelligence Machine Learning

Page 10: Lecture5 - C4.5

ID3Ross Quinlan started with

ID3 (Quinlan, 1979)

C4.5 (Quinlan, 1993)

Some assumptions in the basic algorithmp gAll attributes are nominal

We do not have unknown valuesWe do not have unknown values

With these assumptions Quinlan designed a heuristicWith these assumptions, Quinlan designed a heuristic approach to infer decision trees from labeled data

Slide 10Artificial Intelligence Machine Learning

Page 11: Lecture5 - C4.5

Description of ID3ID3(D, Target, Atts) (Mitchell,1997)returns: a decision tree that correctly classifies the given examples

variablesD: Training set of examplesTarget: Attribute whose value is to be predicted by the treeAtts: List of other attributes that may be tested by the learned decision tree

create a Root node for the treeif D are all positive then Root ← +if D are all positive then Root ← else if D are all negative then Root ← –else if Attrs = ∅ then Root ← most common value of target in Delse

A← the best decision attribute from Attsroot ← Afor each possible value vi of A

add a new tree branch with A=vi

Dvi ← subset of D that have value vi for Aif Dvi = ∅ add then leaf ← most common value of Target in Delse add the subtree ID3( Dvi Target Atts {A} )

Slide 11Artificial Intelligence Machine Learning

else add the subtree ID3( Dvi, Target, Atts-{A} )

Page 12: Lecture5 - C4.5

Description of ID3ID3(D, Target, Atts) (Mitchell,1997)returns: a decision tree that correctly classifies the given examples },...,,{

},...,,{

21

21

k

L

aaaAttsdddD

==

variablesD: Training set of examplesTarget: Attribute whose value is to be predicted by the treeAtts: List of other attributes that may be tested by the learned decision tree

create a Root node for the treeif D are all positive then Root ← + else if D are all negative then Root ← –else if Attrs = ∅ then Root ← most common value of target in Delse if Attrs = ∅ then Root ← most common value of target in Delse

A← the best decision attribute from Attsroot ← Afor each possible value vi of A

add a new tree branch with A=vi

Dvi ← subset of D that have value vi for Aif Dvi = ∅ add then leaf ← most common value of Target in D

else add the subtree ID3( Dvi, Target, Atts-{A} )else add the subtree ID3( Dvi, Target, Atts {A} )

Slide 12Artificial Intelligence Machine Learning

Page 13: Lecture5 - C4.5

Description of ID3ID3(D, Target, Atts) (Mitchell,1997)returns: a decision tree that correctly classifies the given examples },...,,{

},...,,{

21

21

k

L

aaaAttsdddD

==

variablesD: Training set of examplesTarget: Attribute whose value is to be predicted by the treeAtts: List of other attributes that may be tested by the learned decision tree

create a Root node for the treeif D are all positive then Root ← + else if D are all negative then Root ← –else if Attrs = ∅ then Root ← most common value of target in Delse if Attrs = ∅ then Root ← most common value of target in Delse

A← the best decision attribute from Attsroot ← Afor each possible value vi of A

add a new tree branch with A=vi

Dvi ← subset of D that have value vi for Aif Dvi = ∅ add then leaf ← most common value of Target in D

else add the subtree ID3( Dvi, Target, Atts-{A} )else add the subtree ID3( Dvi, Target, Atts {A} )

Slide 13Artificial Intelligence Machine Learning

Page 14: Lecture5 - C4.5

Description of ID3ID3(D, Target, Atts) (Mitchell,1997)returns: a decision tree that correctly classifies the given examples

},...,,{},...,,{

21

21

k

L

aaaAttsdddD

==

variablesD: Training set of examplesTarget: Attribute whose value is to be predicted by the treeAtts: List of other attributes that may be tested by the learned decision tree

},...,,{ 21 kaaaAtts

ai

create a Root node for the treeif D are all positive then Root ← + else if D are all negative then Root ← –else if Attrs = ∅ then Root ← most common value of target in Delse if Attrs = ∅ then Root ← most common value of target in Delse

A← the best decision attribute from Attsroot ← Afor each possible value vi of A

add a new tree branch with A=vi

Dvi ← subset of D that have value vi for Aif Dvi = ∅ add then leaf ← most common value of Target in D

else add the subtree ID3( Dvi, Target, Atts-{A} )else add the subtree ID3( Dvi, Target, Atts {A} )

Slide 14Artificial Intelligence Machine Learning

Page 15: Lecture5 - C4.5

Description of ID3ID3(D, Target, Atts) (Mitchell,1997)returns: a decision tree that correctly classifies the given examples

},...,,{},...,,{

21

21

k

L

aaaAttsdddD

==

variablesD: Training set of examplesTarget: Attribute whose value is to be predicted by the treeAtts: List of other attributes that may be tested by the learned decision tree

},...,,{ 21 kaaaAtts

ai

create a Root node for the treeif D are all positive then Root ← + else if D are all negative then Root ← –else if Attrs = ∅ then Root ← most common value of target in D

1vai = 2vai = ni va =…

else if Attrs = ∅ then Root ← most common value of target in Delse

A← the best decision attribute from Attsroot ← Afor each possible value vi of A

add a new tree branch with A=vi

Dvi ← subset of D that have value vi for Aif Dvi = ∅ add then leaf ← most common value of Target in D

else add the subtree ID3( Dvi, Target, Atts-{A} )else add the subtree ID3( Dvi, Target, Atts {A} )

Slide 15Artificial Intelligence Machine Learning

Page 16: Lecture5 - C4.5

Description of ID3ID3(D, Target, Atts) (Mitchell,1997)returns: a decision tree that correctly classifies the given examples },...,,{

},...,,{

21

21

k

L

aaaAttsdddD

==

variablesD: Training set of examplesTarget: Attribute whose value is to be predicted by the treeAtts: List of other attributes that may be tested by the learned decision tree

ai

create a Root node for the treeif D are all positive then Root ← + else if D are all negative then Root ← –else if Attrs = ∅ then Root ← most common value of target in D

1vai = 2vai = ni va =…

1vai = 2vai = ni va =…

else if Attrs = ∅ then Root ← most common value of target in Delse

A← the best decision attribute from Attsroot ← Afor each possible value vi of A },..,,,..{

},...,,{

111

''2

'1

kii

L

aaaaAttsdddD

+−==

},..,,,..{},...,,{

111

''''''2

'''1

kii

L

aaaaAttsdddD

+−==

add a new tree branch with A=vi

Dvi ← subset of D that have value vi for Aif Dvi = ∅ add then leaf ← most common value of Target in D

else add the subtree ID3( Dvi, Target, Atts-{A} )else add the subtree ID3( Dvi, Target, Atts {A} )

Slide 16Artificial Intelligence Machine Learning

Page 17: Lecture5 - C4.5

Which Attribute Should I Select First?

Which is the best choice?We have 29 positive examples and 35 negative ones

Should I use attribute 1 or attribute 2 in this iteration of the d ?node?

Slide 17Artificial Intelligence Machine Learning

Page 18: Lecture5 - C4.5

Let’s Rely on Information Theory

Use the concept of EntropyCharacterizes the impurity of an arbitrary collection of examplesGiven S:

−−⊕⊕ −−= ppppSEntropy 22 loglog)(

where p+ and p- are the proportion of positive/negative examples in S.

E t i t lExtension to c classes:

∑=

−=c

iippSEntropy

i1

2log)(

Examples:p+=0 → entropy=0p+=1 → entropy=0p+=0.5 p-=0.5 → entropy=1 (maximum)P+=9 p-=5 → entropy=-(9/14)log2(9/14)-

Slide 18Artificial Intelligence Machine Learning

( ) g2( )(5/14)log2(5/14)=0.940

Page 19: Lecture5 - C4.5

EntropyWhat does this measure mean?

Entropy is the minimum number of bits needed to encode the classification of a member of S randomly drawn.

p+=1, the receiver knows the class, no message sent, Entropy=0.p , , g , pyp+=0.5, 1 bit needed.

Optimal length code assigns -log2p to message having probability p

The idea behind is to assign shorter codes to the more probable messages and longer codes to less likely examples.

Th th t d b f bit t d f dThus, the expected number of bits to encode + or – of random member of S is:

)log()log()( 22 −−⊕⊕ −+−= ppppSEntropy

Slide 19Artificial Intelligence Machine Learning

Page 20: Lecture5 - C4.5

Information Gain

Measures the expected reduction in entropy caused by partitioning the examples according to the given attribute

Gain(S,A): the number of bits saved when encoding the target value of anGain(S,A): the number of bits saved when encoding the target value of an arbitrary member of S, knowing the value of attribute A.

Expected reduction in entropy caused by knowing the value of A.|| S

)(||||)(),( )( vAvaluesv SEntropy

SSvSEntropyASGain ∑ ∈

−=

where values(A) is the set of all possible values for A, and Sv is the subset of S for which attribute A has value v

Slide 20Artificial Intelligence Machine Learning

Page 21: Lecture5 - C4.5

Remember the Example?Which is the best choice?

We have 29 positive examples and 35 negative ones

Should I use attribute 1 or attribute 2 in this iteration of the d ?node?

Gain(A1)=0.993 - 26/64 0.70 - 36/64 0.74=0.292Gain(A2)=0.993 - 51/64 0.93 -13/64 0.61 = 0.128

Slide 21Artificial Intelligence Machine Learning

Page 22: Lecture5 - C4.5

Yet Another ExampleThe textbook example

Slide 22Artificial Intelligence Machine Learning

Page 23: Lecture5 - C4.5

Yet Another Example

Slide 23Artificial Intelligence Machine Learning

Page 24: Lecture5 - C4.5

Hypothesis Search Space

Slide 24Artificial Intelligence Machine Learning

Page 25: Lecture5 - C4.5

To Sum UpID3 is a strong system that

Uses hill-climbing search based on the information gain measure to search through the space of decision trees

O t t i l h th iOutputs a single hypothesis.

Never backtracks. It converges to locally optimal solutions.

Uses all training examples at each step, contrary to methods that make decisions incrementally.

Uses statistical properties of all examples: the search is lessUses statistical properties of all examples: the search is less sensitive to errors in individual training examples.

Can handle noisy data by modifying its termination criterion toCan handle noisy data by modifying its termination criterion to accept hypotheses that imperfectly fit the data.

Slide 25Artificial Intelligence Machine Learning

Page 26: Lecture5 - C4.5

Next Class

From ID3 to C4.5. C4.5 extends ID3 and enables the system to:

Be more robust in the presence of noise. Avoiding overfitting

Deal with continuous attributes

Deal with missing data

Convert trees to rules

Slide 26Artificial Intelligence Machine Learning

Page 27: Lecture5 - C4.5

Introduction to Machine Learning

Lecture 5

Albert Orriols i [email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull