Data Mining-Knowledge Presentation—ID3 algorithm

Data Mining-Knowledge Presentation—ID3 algorithm

Prof. Sin-Min LeeDepartment of Computer Science

Data Mining Tasks

Predicting onto new data by using rules or patterns and behaviors – Classification– Estimation

Understanding the groupings, trends, and characteristics of your customer– Segmentation

Visualizing the Euclidean spatial relationships, trends, and patterns of your data– Description

Stages of Data Mining Process

1. Data gathering, e.g., data warehousing.2. Data cleansing: eliminate errors and/or bogus

data, e.g., patient fever = 125.3. Feature extraction: obtaining only the

interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat.

4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort.

5. Visualization of the data.6. Evaluation of results; not every discovered

fact is useful, or even true! Judgment is necessary before following your software's conclusions.

Clusters of GalaxiesSkycat clustered 2x109

sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum. The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe

Cholera outbreak in London

Clustering: Examples

Decision trees are an alternative way of structuring rule information.

humidity windy

outlook

sunny rainovercast

normal true false

N NP P

P

A Classification rule based on the tree

if outlook = overcast

outlook =

sunny &humidity =

normaloutlook = rain

&windy = false

then P

if outlook = overcast

then P

if outlook = sunny &

humidity = normal

then P

if outlook = rain &

windy = falsethen P

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification

Top-Down Induction of Decision Trees ID3

1. A the “best” decision attribute for next node

2. Assign A as decision attribute for node3. For each value of A create new

descendant 4. Sort training examples to leaf node

according to the attribute value of the branch5. If all training examples are perfectly

classified (same value of target attribute) stop, else iterate over new leaf nodes.

Which Attribute is ”best”?

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Entropy

S is a sample of training examplesp+ is the proportion of positive examplesp- is the proportion of negative examplesEntropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)Why?Information theory optimal length code assign –log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S:

-p+ log2 p+ - p- log2 p-

Entropy

• Gain(S,A): expected reduction in entropy due to sorting S on attribute A

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

Information Gain

Information Gain

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]

Entropy([21+,5-]) = 0.71Entropy([8+,30-]) = 0.74Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27

Entropy([18+,33-]) = 0.94Entropy([8+,30-]) = 0.62Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12

A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Training Examples

Selecting the Next Attribute

Humidity

High Normal

[3+, 4-] [6+, 1-]

S=[9+,5-]E=0.940

Gain(S,Humidity)=0.940-(7/14)*0.985 – (7/14)*0.592=0.151

E=0.985 E=0.592

Wind

Weak Strong

[6+, 2-] [3+, 3-]

S=[9+,5-]E=0.940

Gain(S,Wind)=0.940-(8/14)*0.811 – (6/14)*1.0=0.048

Outlook

Sunny Rain

[2+, 3-] [3+, 2-]

S=[9+,5-]E=0.940

Gain(S,Outlook)=0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971=0.247

E=0.971E=0.971

Overcast

[4+, 0]E=0.0

Selecting the Next Attribute

Outlook

Sunny Overcast Rain

Yes

[D1,D2,…,D14] [9+,5-]

Ssunny=[D1,D2,D8,D9,D11] [2+,3-]

? ?

[D3,D7,D12,D13] [4+,0-]

[D4,D5,D6,D10,D14] [3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

ID3 Algorithm

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

[D3,D7,D12,D13]

[D8,D9,D11] [D6,D14][D1,D2] [D4,D5,D10]

The ID3 AlgorithmGiven• a set of disjoint target classes {C1, C2, …, Ck},• a set of training data, S, containing objects of more than

one class.

Let T be any test on a single attribute of the data, with O1, O2, …, On

representing the possible outcomes of applying T to any object x (written as T(x)).

T produces a partition {S1, S2, …, Sn} of S such that

Si = { x | T(x) = Oi}

Proceed recursively to replace each Si with a decision tree.

Crucial factor: Selecting the tests.

S1 Sn

S

O1 OnO2

S2

…

…

In making this decision, Quinlan employs the notion of uncertainty (entropy from information theory).

M = {m1, m2, …, mn} Set of messagesp(mi) Probability of the message mi being receivedI(mi) = -log p(mi) Amount of information of message mi

U(M) = i p(mi) I(mi) Uncertainty of the set M

Quinlan’s assumptions:•A correct decision tree for S will classify objects in the same proportion as their representation in S.•Given a case to classify, a test can be regarded as the source of a message about that case.

Let Ni be the number of cases in S that belong to a class Ci:p(cCi) = Ni / |S|

The uncertainty, U(S), measures the average amount of information needed to determine the class of a random case, cS.

Uncertainty measure after S has been partitioned.UT(S) = i (|Si| / |S|) U(Si)

Select the test T that gains the most information, i.e., whereGS(T) = U(S) – UT(S)

is maximal.

Evaluation of ID3The ID3 algorithm tends to favor tests with a large number of outcomesover tests with a smaller number.Its computational complexity depends on the cost of choosing the next test to branch on;It was adapted to deal with noisy and incomplete data;It is a feasible alternative to knowledge elicitation if sufficient data of the right kind are available;However this method is not incremental.

Further modification were introduced in C4.5, e.g :•pruning the decision tree in order to avoid overfitting•Better test selection heuristic

Search Space and Search Trees

• Search space is logical space composed of – nodes are search states– links are all legal connections between search

states• e.g. in chess, no link between states where W castles

having previously moved K.– always just an abstraction– think of search algorithms trying to navigate

this extremely complex space

Search Trees• Search trees do not summarise all possible searches– instead an abstraction of one

possible search• Root is null state• edges represent one choice

– e.g. to set value of A first• child nodes represent extensions

– children give all possible choices• leaf nodes are solutions/failures• Example in SAT

– algorithm detects failure early– need not pick same variables

everywhere

B(a B )

Im p oss ib le

b(a b )

Im p oss ib le

a(a )

C h oose B

B(A B C )

Im p oss ib le

b(A b C )

S o lu t ion

C(A C )

C h oose B

cA c

im p oss ib le

A(A )

C h oose C

A B C , A B c , A b C , A b c , aB C , ab C , ab cs ta te = ()C h oose A

Definition• A tree shaped structure that

represents a set of decisions. These decisions are used as a basis for predictions.

• They represent rules for classifying datasets. Useful knowledge can be extracted by this classification.

DT Structure• Node Types

– Decision nodes: specifies some test to be carried out on a single attribute value.• Each outcome is assigned to a branch that

connects to a leaf node or another decision node.

– Leaf nodes: indicates the classification of an example.

An Example

Growing a Decision Tree

Iterative Dichotomiser 3 (ID3) Algorithm

• Invented by J. Ross Quinlan in 1975• Based on a greedy search algorithm.

ID3 Cont.• The goal is to create the best possible

tree that works on all available data.– An example is strictly either belonging to

one class or the other.– Need to select the attribute that best

classify the example (i.e. need to select the attribute with smallest entropy on all the example.)

– The lower the entropy the higher the Information Gain. We desire high IG.

Entropy• A quantitative measurement of the

homogeniety of a set of examples.• It tells us how well an attribute

separate the training examples according to their target classification.

Entropy cont.• Given a set S with only positive or

negative examples (2 class case):

Entropy(S) = -PPlog2PP – Pnlog2Pn

Where PP = proportion of positive examples PN = proportion of negative examples

Entropy cont.• Ex. Given 25 examples with 15 positive

and 10 negative.Entropy(S) = -(15/25)log2(15/25)

-(10/25)log2(10/25) = .97

If Entropy(S) = 0 all members in S belong to strictly one class.

If Entropy(S) = 1 (max value) members are split equally between the two classes.

In General…• In general, if an attribute takes more

than two values

entropy(S) =

Where n is the number of values

n

iii pp

1

)log(

Information Gain

Where A is an attribute of S Value(A) is the set of possible value of A

v is a particular value in Value(A) Sv is a subset of S having of v’s on value(A).

Actual AlgorithmID3 (Examples, Target_Attribute, Attributes)Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A,

Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty

Then below this new branch add a leaf node with label = most common target value in the examples

Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})

End Return Root

Independent Attributes Dependent

Outlook Temperature Humidity Windy Play

sunny 85 85 FALSE Don’t play

sunny 80 90 TRUE Don’t play

overcast 83 78 FALSE Play

rain 70 95 FALSE Play


rain 65 70 TRUE Don’t play

overcast 64 65 TRUE Play


sunny 69 70 FALSE Play


sunny 75 70 TRUE Play




We choose Play as our dependent attribute.Don’t play = NegativePlay = Positive

Outlook playdon’t play total Entropy

sunny 2 3 5 0.34676807

ovecast 4 0 4 0

rain 3 2 5 0.34676807

total 0.69353614

Temp

>70 5 4 9 0.63712032

<=70 4 1 5 0.25783146

total 0.89495179

Humidity

>70 6 4 10 0.69353614

<=70 3 1 4 0.23179375

total 0.92532989

Windy

TRUE 3 3 6 0.42857143

FALSE 6 2 8 0.4635875

total 0.89215893

play don’t play total Entropy

Temp

>70 1 3 4 0.6490225

<=70 1 0 1 0

total 0.6490225

Humidity

>70 0 3 3 0

<=70 2 0 2 0

total 0

Windy

TRUE 1 1 2 0.4

FALSE 1 2 3 0.5509775

total 0.9509775


Temp

>70 1 1 2 0.4

<=70 2 1 3 0.5509775

total 0.9509775

Humidity

>70 1 0 3 0

<=70 3 1 4 0.6490225

total 0.6490225

Windy

TRUE 0 2 2 0

FALSE 3 0 3 0

total 0


Temp

>70 2 0 2 0

<=70 0 2 2 0

total 0

Humidity

>70 3 0 3 0

<=70 1 0 1 0

total 0

Windy

TRUE 2 0 2 0

FALSE 2 0 2 0

total 0

ID3 Summary• Step 1: Take all unused attribute

and count their entropy concerning test samples.

• Step 2: Choose the attribute with smallest entropy.

• Step 3: Make a node containing that attribute.

Growth stops when:• Every attribute already exist along

the path thru the tree.• The training examples associated

with a leaf all have the same target attribute.

References• An Overview of Data Mining Techniques

– http://www.thearling.com/text/dmtechniques/dmtechniques.htm

• Decision tree– http://en.wikipedia.org/wiki/Decision_tree

• Decision Trees – http://dms.irb.hr/tutorial/tut_dtrees.php

http://dms.irb.hr/tutorial/tut_dtrees.php

Data Mining-Knowledge Presentation—ID3 algorithm

Documents

Transcript of Data Mining-Knowledge Presentation—ID3 algorithm