MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR...

24
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR

Transcript of MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR...

Page 1: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING

KIRANKUMAR K. TAMBALKAR

Page 2: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

What is Discretization?

Discretization concerns the process of transferring continuous functions, models and equations into discrete values.

This process is usually carried out as first step towards making them suitable for numerical evaluation and implementation on digital computers.

Page 3: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Why Discretization?

The main aim is reduce the number of values of continuous attribute to discrete attribute.

Typically data is discretized into a partitions of K equal lengths/widths (equal intervals).

Page 4: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Discretization

Discretization of continuous-valued attributes. First present the result about the

information entropy minimization. Heuristic for binary discretization (Two-

interval splits) A better understanding of the heuristic and it’s

behavior.

Formal evidence that supports the usage of the heuristic in this context.

Page 5: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Binary Discretization

A continuous-valued attribute is typically discretized during decision tree generation by partitioning its range into two intervals.

Threshold value ‘T’ Continuous attribute ‘A’ is determined and the

test A<=T assigned to the left branch while A>T is assigned to the right branch.

We call such threshold value T, a cut point.

Page 6: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

What is Entropy

Entropy. It’s also called Expected Information Entropy. That’s what we call this value which essentially describes how consistently a potential split will match up with a classifier.

Ex: Let’s say we are looking below age of 25. Out of that group how many people can we expect to have an income above 50K or below 50K?

Lower entropy is better, and a 0 value for entropy is the best.

Page 7: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Data set example

Features (f1) Features (f2) Class Labels

Attributes (a1) Attributes (b1) Class Labels

(a2) (b2) Class Labels

(a3) (b3) Class Labels

(a4) (b4) Class Labels

(a5) (b5) Class Labels

(a6) (b6) Class Labels

(a7) (b7) Class Labels

(a8) (b8) Class Labels

(a9) (b9) Class Labels

Page 8: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Binary Discretization We select an attribute for branching at a node having a set

S of N examples. For each continuous-valued attribute A we select the

“best” cut point TA from its range of values by evaluation. First we sort the given set or data into the increasing order

of attribute ‘A’. And the midpoint between the each successive pair of

example in the sorted sequence is evaluated as a potential cut point.

Thus for each continuous-valued attribute, N-1 evaluations will take place for each evaluation of a candidate cut point T, then the data is partitioned into two sets.

Then class entropy of the resulting partition is computed.

Page 9: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Example

8

9

7

3

2

5

1

6

4

10

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

Page 10: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Example

Page 11: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Binary Discretization Let ‘T’ partition the set ‘S’ of examples into the

subsets ‘S1’ and ‘S2’. Let there be ‘K’ classes ‘C1…..,Ck’ and let P(Ci,S)

be the proportion of examples in ‘S’ that have class Ci the class entropy of a subset S is defined as:

Page 12: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Class Entropy

Page 13: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Binary Discretization When the logarithm base is 2, Ent(S) measures

the amount of information needed in bits. To specify the classes in S.

To evaluate the resulting class entropy after a set S is partitioned into two sets S1 and S2

Page 14: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Example For an example set S an attribute A and a

cut point value T. Let S1 subset S be the subset of examples in S

with A values <=T and S2=S-S1. The class information entropy of the partition

induced by T. E(A, T, S) is defined as.

Page 15: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Example

Page 16: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Algorithm

Binary Discretization A binary discretization for A is determined

by selecting the cut point TA for which E(A, TA, S) is minimum amongst all the candidate cut points.

Page 17: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Gain of the entropy

Once we find out the minimum amongst all the candidate cut points, then compute the gain in the entropy.

How to compute the gain of entropy?

Page 18: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Gain of the entropy

Page 19: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Gain of the entropy

Page 20: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

MDLPC Criterion

The Minimum Description Length Principle: Once we find the gain of the entropy now

we are ready to state our decision criterion for accepting or rejecting a given partition based on the MDLP.

Page 21: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

MDCLPC Criteria

The partition induced by a cut point T for a set S of N examples is accepted then discretization process will through and we provide the discrete value to the each and every class from that dataset.

The partition induced by a cut point T for a set S of N examples is rejected then cut point which we selected is wrong find the cut points again from the given example dataset.

Page 22: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Empirical Evaluation

We compare four different decision strategies for deciding whether or not to accept a partition. Following criteria we follow for variations of algorithm

Never Cut: The original binary interval algorithm

Always Cut: Always accept a cut unless all examples have the same class or the same value for the attribute.

Random cut: Accepts or rejects by flipping the fair coin.

MDLP cut: The derived MDLPC criterion.

Page 23: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Results

Page 24: MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Thank you