Data Mining and Knowledge Discovery Handbook || Discretization Methods

6

Discretization Methods

Ying Yang1, Geoffrey I. Webb2, and Xindong Wu3

1 School of Computer Science and Software Engineering, Monash University, Melbourne,

Australia [email protected] Faculty of Information Technology

Monash University, Australia

[email protected] Department of Computer Science

University of Vermont, USA

[email protected]

Summary. Data-mining applications often involve quantitative data. However, learning from

quantitative data is often less effective and less efficient than learning from qualitative data.

Discretization addresses this issue by transforming quantitative data into qualitative data. This

chapter presents a comprehensive introduction to discretization. It clarifies the definition of

discretization. It provides a taxonomy of discretization methods together with a survey of

major discretization methods. It also discusses issues that affect the design and application of

discretization methods.

Key words: Discretization, quantitative data, qualitative data.

Introduction

Discretization is a data-processing procedure that transforms quantitative data into

qualitative data.

Data Mining applications often involve quantitative data. However, there exist

many learning algorithms that are primarily oriented to handle qualitative data (Ker-

ber, 1992, Dougherty et al., 1995, Kohavi and Sahami, 1996). Even for algorithms

that can directly deal with quantitative data, learning is often less efficient and less

effective (Catlett, 1991, Kerber, 1992, Richeldi and Rossotto, 1995, Frank and Wit-

ten, 1999). Hence discretization has long been an active topic in Data Mining and

knowledge discovery. Many discretization algorithms have been proposed. Evalua-

tion of these algorithms has frequently shown that discretization helps improve the

performance of learning and helps understand the learning results.

This chapter presents an overview of discretization. Section 6.1 explains the ter-

minology involved in discretization. It clarifies the definition of discretization, which

has been defined in many differing way in previous literature. Section 6.2 presents a

O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_6, © Springer Science+Business Media, LLC 2010

102 Ying Yang, Geoffrey I. Webb, and Xindong Wu

comprehensive taxonomy of discretization approaches. Section 6.3 introduces typi-

cal discretization algorithms corresponding to the taxonomy. Section 6.4 addresses

the issue that different discretization strategies are appropriate for different learn-

ing problems. Hence designing or applying discretization should not be blind to its

learning context. Section 6.5 provides a summary of this chapter.

6.1 Terminology

Discretization transforms one type of data to another type. In the large amount of

existing literature that addresses discretization, there is considerable variation in the

terminology used to describe these two data types, including ‘quantitative’ vs. ‘qual-

itative’, ‘continuous’ vs. ‘discrete’, ‘ordinal’ vs. ‘nominal’, and ‘numeric’ vs. ‘cat-

egorical’. It is necessary to make clear the difference among the various terms and

accordingly choose the most suitable terminology for discretization.

We adopt the terminology of statistics (Bluman, 1992, Samuels and Witmer,

1999), which provides two parallel ways to classify data into different types. Data

can be classified into either qualitative or quantitative. Data can also be classified

into different levels of measurement scales. Sections 6.1.1 and 6.1.2 summarize this

terminology.

6.1.1 Qualitative vs. quantitative

Qualitative data, also often referred to as categorical data, are data that can be

placed into distinct categories. Qualitative data sometimes can be arrayed in a mean-

ingful order. But no arithmetic operations can be applied to them. Examples of qual-

itative data are: blood type of a person: A, B, AB, O; and assignment evaluation: fail,pass, good, excellent.

Quantitative data are numeric in nature. They can be ranked in order. They also

admit to meaningful arithmetic operations. Quantitative data can be further classified

into two groups, discrete or continuous.

Discrete data assume values that can be counted. The data cannot assume all val-

ues on the number line within their value range. An example is: number of childrenin a family.

Continuous data can assume all values on the number line within their value

range. The values are obtained by measuring. An example is: temperature.

6.1.2 Levels of measurement scales

In addition to being classified into either qualitative or quantitative, data can also be

classified by how they are categorized, counted or measured. This type of classifi-

cation uses measurement scales, and four common levels of scales are: nominal,

ordinal, interval and ratio.

6 Discretization Methods 103

The nominal level of measurement scales classifies data into mutually exclusive

(non-overlapping), exhaustive categories in which no meaningful order or ranking

can be imposed on the data. An example is: blood type of a person: A, B, AB, OThe ordinal level of measurement scales classifies data into categories that can

be ranked. However, the differences between the ranks cannot be calculated by arith-

metic. An example is: assignment evaluation: fail, pass, good, excellent. It is mean-

ingful to say that the assignment evaluation of pass ranks higher than that of fail. It

is not meaningful in the same way to say that the blood type of A ranks higher than

that of B.

The interval level of measurement scales ranks data, and the differences between

units of measure can be calculated by arithmetic. However, zero in the interval level

of measurement does not mean ‘nil’ or ‘nothing’ as zero in arithmetic means. An

example is: Fahrenheit temperature. It has a meaningful difference of one degree

between each unit. But 0 degree Fahrenheit does not mean there is no heat. It is

meaningful to say that 74 degree is two degrees higher than 72 degree. It is not

meaningful in the same way to say that the evaluation of excellent is two degrees

higher than the evaluation of good.

The ratio level of measurement scales possesses all the characteristics of interval

measurement, and there exists a zero that, the same as arithmetic zero, means ‘nil’ or

‘nothing’. In consequence, true ratios exist between different units of measure. An

example is: number of children in a family. It is meaningful to say that family X has

twice as many children as does family Y. It is not meaningful in the same way to say

that 100 degree Fahrenheit is twice as hot as 50 degree Fahrenheit.

The nominal level is the lowest level of measurement scales. It is the least power-

ful in that it provides the least information about the data. The ordinal level is higher,

followed by the interval level. The ratio level is the highest. Any data conversion

from a higher level of measurement scales to a lower level of measurement scales

will lose information. Table 6.1 gives a summary of the characteristics of different

levels of measurement scales.

Table 6.1. Measurement Scales

Level Ranking ? Arithmetic operation ? Arithmetic zero ?Nominal no no no

Ordinal yes no no

Interval yes yes no

Ratio yes yes yes

6.1.3 Summary

In summary, the following classification of data types applies:

1. qualitative data:

a) nominal;

b) ordinal;


2. quantitative data:

a) interval, either discrete or continuous;

b) ratio, either discrete or continuous.

We believe that ‘discretization’ as it is usually applied in data mining is best de-

fined as the transformation from quantitative data to qualitative data. In consequence,

we will refer to data as either quantitative or qualitative throughout this chapter.

6.2 Taxonomy

There exist diverse taxonomies in the existing literature to classify discretization

methods. Different taxonomies emphasize different aspects of the distinctions among

discretization methods.

Typically, discretization methods can be either primary or composite. Primary

methods accomplish discretization without reference to any other discretization

method. Composite methods are built on top of some primary method(s).

Primary methods can be classified as per the following taxonomies.

1. Supervised vs. Unsupervised (Dougherty et al., 1995). Methods that use the

class information of the training instances to select discretization cut points are

supervised. Methods that do not use the class information are unsupervised.

Supervised discretization can be further characterized as error-based, entropy-based or statistics-based according to whether intervals are selected using met-

rics based on error on the training data, entropy of the intervals, or some statisti-

cal measure.

2. Parametric vs. Non-parametric. Parametric discretization requires input from

the user, such as the maximum number of discretized intervals. Non-parametric

discretization only uses information from data and does not need input from the

user.

3. Hierarchical vs. Non-hierarchical. Hierarchical discretization selects cut points

in an incremental process, forming an implicit hierarchy over the value range.

The procedure can be split or merge (Kerber, 1992). Split discretization ini-

tially has the whole value range as an interval, then continues splitting it into

sub-intervals until some threshold is met. Merge discretization initially puts

each value into an interval, then continues merging adjacent intervals until some

threshold is met. Some discretization methods utilize both split and merge pro-

cesses. For example, intervals are initially formed by splitting, and then a merge

process is performed to post-process the formed intervals. Non-hierarchical dis-

cretization does not form any hierarchy during discretization. For example, many

methods scan the ordered values only once, sequentially forming the intervals.

4. Univariate vs. Multivariate (Bay, 2000). Methods that discretize each attribute

in isolation are univariate. Methods that take into consideration relationships

among attributes during discretization are multivariate.

5. Disjoint vs. Non-disjoint (Yang and Webb, 2002). Disjoint methods discretize

the value range of the attribute under discretization into disjoint intervals. No


intervals overlap. Non-disjoint methods discretize the value range into intervals

that can overlap.

6. Global vs. Local (Dougherty et al., 1995). Global methods discretize with re-

spect to the whole training data space. They perform discretization once only,

using a single set of intervals throughout a single classification task. Local meth-

ods allow different sets of intervals to be formed for a single attribute, each set

being applied in a different classification context. For example, different dis-

cretizations of a single attribute might be applied at different nodes of a decision

tree (Quinlan, 1993).

7. Eager vs. Lazy (Hsu et al., 2000, Hsu et al., 2003). Eager methods perform

discretization prior to classification time. Lazy methods perform discretization

during the classification time.

8. Time-sensitive vs. Time-insensitive. Under time-sensitive discretization, the

qualitative value associated with a quantitative value can change along the time.

That is, the same quantitative value can be discretized into different values de-

pending on the previous values observed in the time series. Time-insensitive

discretization only uses the stationary pro-perties of the quantitative data.

9. Ordinal vs. Nominal. Ordinal discretization transforms quantitative data into

ordinal qualitative data. It aims at taking advantage of the ordering information

implicit in quantitative attributes, so as not to make values 1 and 2 as dissimi-

lar as values 1 and 10. Nominal discretization transforms quantitative data into

nominal qualitative data. The ordering information is hence discarded.

10. Fuzzy vs. Non-fuzzy (Wu, 1995, Wu, 1999, Ishibuchi et al., 2001). Fuzzy dis-

cretization first discretizes quantitative attribute values into intervals. It then

places some kind of membership function at each cut point as fuzzy borders.

The membership function measures the degree of each value belonging to each

interval. With these fuzzy borders, a value can be discretized into a few different

intervals at the same time, with varying degrees. Non-fuzzy discretization forms

sharp borders without employing any membership function.

Composite methods first choose some primary discretization method to form the

initial cut points. They then focus on how to adjust these initial cut points to achieve

certain goals. The taxonomy of a composite method sometimes is flexible, depending

on the taxonomy of its primary method.

6.3 Typical methods

Corresponding to our taxonomy in the previous section, we here enumerate some

typical discretization methods. There are many other methods that are not reviewed

due to the space limit. For a more comprehensive study on existing discretization

algorithms, Yang (2003) and Wu (1995) offer good sources.


6.3.1 Background and terminology

A term often used for describing a discretization approach is ‘cut point’. Discretiza-

tion forms intervals according to the value range of the quantitative data. It then as-

sociates a qualitative value to each interval. A cut point is a value among the quanti-

tative data where an interval boundary is located by a discretization method. Another

commonly-mentioned term is ‘boundary cut point’, which are values between two

instances with different classes in the sequence of instances sorted by a quantitative

attribute. It has been proved that evaluating only the boundary cut points is sufficient

for finding the minimum class information entropy (Fayyad and Irani, 1993).

We use the following terminology. Data comprises a set or sequence of instances.

Each instance is described by a vector of attribute values. For classification learning,

each instance is also labelled with a class. Each attribute is either qualitative or quan-

titative. Classes are qualitative. Instances from which one learns cut points or other

knowledge are training instances. If a test instance is presented, a learning algo-

rithm is asked to make a prediction about the test instance according to the evidence

provided by the training instances.

6.3.2 Equal-width, equal-frequency and fixed-frequency discretization

We arrange to present these three methods together because they are seemingly sim-

ilar but actually different. They all are typical of unsupervised discretization. They

are also typical of parametric discretization.

When discretizing a quantitative attribute, equal width discretization (EWD)

(Catlett, 1991, Kerber, 1992, Dougherty et al., 1995) predefines k, the number of

intervals. It then divides the number line between vmin and vmax into k intervals of

equal width, where vmin is the minimum observed value, vmax is the maximum ob-

served value. Thus the intervals have width w = (vmax− vmin)/k and the cut points

are at vmin +w,vmin +2w, · · · ,vmin +(k−1)w.

When discretizing a quantitative attribute, equal-frequency discretization (EFD)

(Catlett, 1991, Kerber, 1992, Dougherty et al., 1995) predefines k, the number of in-

tervals. It then divides the sorted values into k intervals so that each interval contains

approximately the same number of training instances. Suppose there are n training

instances, each interval then contains n/k training instances with adjacent (possibly

identical) values. Note that training instances with identical values must be placed

in the same interval. In consequence it is not always possible to generate k equal-

frequency intervals.

When discretizing a quantitative attribute, fixed-frequency discretization

(FFD) (Yang and Webb, 2004) predefines a sufficient interval frequency k. Then

it discretizes the sorted values into intervals so that each interval has approximately4

the same number k of training instances with adjacent (possibly identical) values.

It is worthwhile contrasting EFD and FFD, both of which form intervals of equal

frequency. EFD fixes the interval number that is usually arbitrarily chosen. FFD fixes

4 Just as for EFD, because of the existence of identical values, some intervals can have in-

stance frequency exceeding k.


the interval frequency that is not arbitrary but to ensure each interval contains suffi-

cient instances to supply information such as for estimating probability.

6.3.3 Multi-interval-entropy-minimization discretization ((MIEMD)

Multi-interval-entropy-minimization discretization (Fayyad and Irani, 1993) is typ-

ical of supervised discretization. It is also typical of non-parametric discretization.

To discretize an attribute, MIEMD evaluates as a candidate cut point the midpoint

between each successive pair of the sorted values. For evaluating each candidate cut

point, the data are discretized into two intervals and the resulting class information

entropy is calculated. A binary discretization is determined by selecting the cut point

for which the entropy is minimal amongst all candidates. The binary discretization

is applied recursively, always selecting the best cut point. A minimum description

length criterion (MDL) is applied to decide when to stop discretization.

6.3.4 ChiMerge, StatDisc and InfoMerge discretization

EWD and EFD are non-hierarchical discretization. MIEMD involves a split proce-

dure and hence is hierarchical discretization. A typical merge approach to hierarchi-

cal discretization is ChiMerge (Kerber, 1992). It uses the χ2 (Chi square) statistic

to determine if the relative class frequencies of adjacent intervals are distinctly dif-

ferent or if they are similar enough to justify merging them into a single interval.

The ChiMerge algorithm consists of an initialization process and a bottom-up merg-

ing process. The initialization process contains two steps: (1) ascendingly sort the

training instances according to their values for the attributes being discretized, (2)

construct the initial discretization, in which each instance is put into its own interval.

The interval merging process contains two steps, repeated continuously: (1) compute

the χ2 for each pair of adjacent intervals, (2) merge the pair of adjacent intervals with

the lowest χ2 value. Merging continues until all pairs of intervals have χ2 values ex-

ceeding a predefined χ2-threshold. That is, all intervals are considered significantly

different by the χ2 independence test. The recommended χ2-threshold is at the 0.90,

0.95 or 0.99 significant level.

StatDisc discretization (Richeldi and Rossotto, 1995) extends ChiMerge to al-

low any number of intervals to be merged instead of only 2 as ChiMerge does. Both

ChiMerge and StatDisc are based on a statistical measure of dependency. The statis-

tical measures treat an attribute and a class symmetrically. A third merge discretiza-

tion, InfoMerge (Freitas and Lavington, 1996) argues that an attribute and a class

should be asymmetric since one wants to predict the value of the class attribute giventhe discretized attribute but not the reverse. Hence InfoMerge uses information loss,

which is calculated as the amount of information necessary to identify the class of an

instance after merging and the amount of information before merging, to direct the

merge procedure.


6.3.5 Cluster-based discretization

The above mentioned methods are all univariate. A typical multivariate discretization

technique is cluster-based discretization (Chmielewski and Grzymala-Busse, 1996).

This method consists of two steps. The first step is cluster formation to determine

initial intervals for the quantitative attributes. The second step is post-processing to

minimize the number of discretized intervals. Instances here are deemed as points in

n-dimensional space which is defined by n attribute values. During cluster formation,

the median cluster analysis method is used. Clusters are initialized by allowing each

instance to be a cluster. New clusters are formed by merging two existing clusters that

exhibit the greatest similarity between each other. The cluster formation continues as

long as the level of consistency of the partition is not less than the level of consistency

of the original data. Once this process is completed, instances that belong to the same

cluster are indiscernible by the subset of quantitative attributes, thus a partition on

the set of training instances is induced. Clusters can be analyzed in terms of all

attributes to find out cut points for each attribute simultaneously. After discretized

intervals are formed, post-processing picks a pair of adjacent intervals among all

quantitative attributes for merging whose resulting class entropy is the smallest. If

the consistency of the dataset after the merge is above a given threshold, the merge

is performed. Otherwise this pair of intervals are marked as non-mergable and the

next candidate is processed. The process stops when each possible pair of adjacent

intervals are marked as non-mergable.

6.3.6 ID3 discretization

ID3 provides a typical example of local discretization. ID3 (Quinlan, 1986) is an

inductive learning program that constructs classification rules in the form of a de-

cision tree. It uses local discretization to deal with quantitative attributes. For each

quantitative attribute, ID3 divides its sorted values into two intervals in all possible

ways. For each division, the resulting information gain of the data is calculated. The

attribute that obtains the maximum information gain is chosen to be the current tree

node. And the data are divided into subsets corresponding to its two value intervals.

In each subset, the same process is recursively conducted to grow the decision tree.

The same attribute can be discretized differently if it appears in different branches of

the decision tree.

6.3.7 Non-disjoint discretization

The above mentioned methods are all disjoint discretization. Non-disjoint discretiza-

tion (NDD) (Yang and Webb, 2002), on the other hand, forms overlapping inter-

vals for a quantitative attribute, always locating a value toward the middle of its

discretized interval. This strategy is desirable since it can efficiently form for each

single quantitative value a most appropriate interval.

When discretizing a quantitative attribute, suppose there are N instances. NDD

identifies among the sorted values t ′ atomic intervals, (a′1,b′1],(a

′2,b

′2], ...,(a

′t ′ ,b

′t ′ ],


each containing s′ instances, so that5

s′ =s3

s′ × t ′ = N. (6.1)

One interval is formed for each set of three consecutive atomic intervals, such

that the kth (1 ≤ k ≤ t ′ − 2) interval (ak,bk] satisfies ak = a′k and bk = b′k+2. Each

value v is assigned to interval (a′i−1,b′i+1] where i is the index of the atomic interval

(a′i,b′i] such that a′i < v≤ b′i, except when i = 1 in which case v is assigned to interval

(a′1,b′3] and when i = t ′ in which case v is assigned to interval (a′t ′−2,b

′t ′ ]. Figure 6.1

illustrates the procedure. As a result, except in the case of falling into the first or the

last atomic interval, a numeric value is always toward the middle of its corresponding

interval, and intervals can overlap with each other.

Atomic Interval

IntervalFig. 6.1. Atomic Intervals Compose Actual Intervals

6.3.8 Lazy discretization

The above mentioned methods are all eager. In comparison, lazy discretization

(LD) (Hsu et al., 2000,Hsu et al., 2003) defers discretization until classification time.

It waits until a test instance is presented to determine the cut points for each quan-

titative attribute of this test instance. When classifying an instance, LD creates only

one interval for each quantitative attribute containing its value from the instance, and

leaves other value regions untouched. In particular, it selects a pair of cut points for

each quantitative attribute such that the value is in the middle of its corresponding in-

terval. Where the cut points locate is decided by LD’s primary discretization method,

such as EWD.

5 Theoretically any odd number k besides 3 is acceptable in (6.1) as long as the same number

k of atomic intervals are grouped together later for the probability estimation. For simplic-

ity, we take k = 3 for demonstration.


6.3.9 Dynamic-qualitative discretization

The above mentioned methods are all time-insensitive while dynamic-qualitative

discretization (Mora et al., 2000) is typically time-sensitive. Two approaches are

individually proposed to implement dynamic-qualitative discretization. The first ap-

proach is to use statistical information about the preceding values observed from

the time series to select the qualitative value which corresponds to a new quantita-

tive value of the series. The new quantitative value will be associated to the same

qualitative value as its preceding values if they belong to the same population. Oth-

erwise, it will be assigned a new qualitative value. To decide if a new quantitative

value belongs to the same population as the previous ones, a statistic with Student’s

t distribution is computed.

The second approach is to use distance functions. Two consecutive quantitative

values correspond to the same qualitative value when the distance between them is

smaller than a predefined threshold significant distance. The first quantitative value

of the time series is used as reference value. The next values in the series are com-

pared with this reference. When the distance between the reference and a specific

value is greater than the threshold, the comparison process stops. For each value

between the reference and the last value which has been compared, the following

distances are computed: distance between the value and the first value of the inter-

val, and distance between the value and the last value of the interval. If the former

is lower than the latter, the qualitative value assigned is the one corresponding to the

first value. Otherwise, the qualitative value assigned is the one corresponding to the

last value.

6.3.10 Ordinal discretization

Ordinal discretization (Frank and Witten, 1999, Macskassy et al., 2001), as its name

indicates, conducts a transformation of quantitative data that is able to preserve their

ordering information. For a quantitative attribute, ordinal discretization first uses

some primary discretization method to form a qualitative attribute with n values

(v1,v2, · · · ,vn). Then it introduces n−1 boolean attributes. The ith boolean attribute

represents the test A∗ ≤ vi. These boolean attributes are substituted for the original Aand are input to the learning process.

6.3.11 Fuzzy discretization

Fuzzy discretization (FD) (Ishibuchi et al., 2001) is employed for generating linguis-

tic association rules, where many linguistic terms, such as ‘short’ and ‘tall’, can not

be appropriately represented by intervals with sharp cut points. Hence, it employs a

membership function, such as in (6.2), so that height 150 millimeter is of 0 degree to

indicate ‘tall’; height 175 millimeter is of 0.5 degree to indicate ‘tall’ and height 190

millimeter is of 1.0 degree to indicate ‘tall’. The induction of rules will take those

degrees into consideration.


Memtall(x) =

0, if x <= 170;(x−170)/10 if 170<x<180;1, if x>=180.

(6.2)

FD uses the domain knowledge to define its linguistic membership functions.When dealing with data without such domain knowledge, fuzzy borders can stillbe set up with commonly used functions such as linear, polynomial and arctan, tofuzzify the sharp borders (Wu, 1999). Wu (1999) demonstrated that such fuzzy bor-ders can be useful when applying rules produced by induction from training exam-ples to a test example, no rules match the test example.

6.3.12 Iterative-improvement discretization

A typical composite discretization is iterative-improvement discretization (IID) (Paz-zani, 1995). It initially forms a set of intervals using EWD or MIEMD, and theniteratively adjusts the intervals to minimize the classification error on the trainingdata. It defines two operators: merge two contiguous intervals, or split an intervalinto two intervals by introducing a new cut point that is midway between each pairof contiguous values in that interval. In each loop of the iteration, for each quanti-tative attribute, IID applies both operators in all possible ways to the current set ofintervals and estimates the classification error of each adjustment using leave-one-out cross validation. The adjustment with the lowest error is retained. The loop stopswhen no adjustment further reduces the error. IID can split as well as merge dis-cretized intervals. How many intervals will be formed and where the cut points arelocated are decided by the error of the cross validation.

6.3.13 Summary

For each entry of our taxonomy presented in the previous section, we have revieweda typical discretization method. Table 6.2 summarizes these methods by identifyingtheir categories under each entry of our taxonomy.

6.4 Discretization and the learning context

Although various discretization methods are available, they are tuned to differenttypes of learning, such as decision tree learning, decision rule learning, naive-Bayeslearning, Bayes network learning, clustering, and association learning. Differenttypes of learning have different characteristics and hence require different strate-gies of discretization. It is important to be aware of the leaning context wheneverto design or employ discretization methods. It is unrealistic to pursue a universallyoptimal discretization approach that can be blind to its learning context.

For example, decision tree learners can suffer from the fragmentation problem,and hence they may benefit more than other learners from discretization that resultsin few intervals. Decision rule learners require pure intervals (containing instances


dominated by a single class), while probabilistic learners such as naive-Bayes does

not. Association rule learners value the relations between attributes, and thus they

desire multivariate discretization that can capture the inter-dependencies among at-

tributes. Lazy learners can further save training effort if coupled with lazy discretiza-

tion. If a learning algorithm requires values of an attribute to be disjoint, such as

decision tree learning, non-disjoint discretization is not applicable.

To explain this issue, we compare the discretization strategies of two popular

learning algorithms, decision tree learning and naive-Bayes learning. Although both

are widely used for inductive learning, decision trees and naive-Bayes classifiers

have very different inductive biases and learning mechanisms. Correspondingly, their

desirable discretization should take different approaches.

6.4.1 Discretization for decision tree learning

Decision tree learning represents the learned concept by a decision tree. Each non-

leaf node tests an attribute. Each branch descending from that node corresponds

to one of the attribute’s values. Each leaf node assigns a class label. A decision

tree classifies instances by sorting them down the tree from the root to some leaf

node (Mitchell, 1997). ID3 (Quinlan, 1986) and its successor C4.5 (Quinlan, 1993)

are well known exemplars of decision tree algorithms.

One popular discretization for decision tree learning is multi-interval-entropy-

minimization discretization (MIEMD) (Fayyad and Irani, 1993), as we have re-

viewed in Section 6.3. MIEMD discretizes a quantitative attribute by calculating the

class information entropy as if the classification only uses that single attribute after

discretization. This can be suitable for the divide-and-conquer strategy of decision

tree learning, but not necessarily appropriate for other learning mechanisms such as

naive-Bayes learning (Yang and Webb, 2004).

Furthermore, MIEMD uses the minimum description length criterion (MDL) as

the termination condition that decides when to stop further partitioning a quantita-

tive attribute’s value range. This has an effect to form qualitative attributes with few

values (An and Cercone, 1999). This is only desirable for some learning contexts.

For decision tree learning, it is important to minimize the number of values of an

attribute, so as to avoid the fragmentation problem (Quinlan, 1993). If an attribute

has many values, a split on this attribute will result in many branches, each of which

receives relatively few training instances, making it difficult to select appropriate

subsequent tests. However, minimizing the number of intervals has adverse impact

on naive-Bayes learning as we will detail in the next section.

6.4.2 Discretization for naive-Bayes learning

When classifying an instance, naive-Bayes classifiers assume attributes condition-

ally independent of each other given the class6; and then apply Bayes’ theorem to

calculate the probability of each class given this instance. The class with the highest

6 This assumption is often referred to as the attribute independence assumption.


probability is chosen as the class of this instance. Naive-Bayes classifiers are simple,

effective7, efficient, robust and support incremental training. These merits have seen

them deployed in numerous classification tasks.

The appropriate discretization methods for naive-Bayes learning include fixed-

frequency discretization (Yang, 2003) and non-disjoint discretization (Yang and

Webb, 2002), which we have introduced in Section 6.3. Although it has demon-

strated strong effectiveness for decision tree learning, MIEMD does not suit naive-

Bayes learning. Naive-Bayes learning assumes that attributes are independent of one

another given the class, and hence is not subject to the fragmentation problem of

decision tree learning. MIEMD tends to minimize the number of discretized inter-

vals, which has a strong potential to reduce the classification variance but increase

the classification bias (Yang and Webb, 2004). As the data size becomes large, it is

very likely that the loss through bias increase will soon overshadow the gain through

variance reduction, resulting in inferior learning performance. However, naive-Bayes

learning is particularly popular with learning from large data because of its efficiency.

Hence, MIEMD is not a desirable approach for discretization in naive-Bayes learn-

ing.

The other way around, if we employ fixed-frequency discretization (FFD) for

decision tree learning, the resulting learning performance can be inferior. FFD tends

to maximize the number of discretized intervals as long as each interval contains

sufficient instances for estimating the naive-Bayes probabilities. Hence FFD has a

strong potential to cause a severe fragmentation problem for decision tree learning,

especially when the data size is large.

6.5 Summary

Discretization is a process that transforms quantitative data to qualitative data. It

builds a bridge between real-world data-mining applications where quantitative data

flourish, and the learning algorithms many of which are more adept at learning

from qualitative data. Hence, discretization has an important role in Data Mining

and knowledge discovery. This chapter provides a high level overview of discretiza-

tion. We have defined and presented terminology for discretization, clarifying the

multiplicity of differing definitions among previous literature. We have introduced a

comprehensive taxonomy of discretization. Corresponding to each entry of the tax-

onomy, we have demonstrated a typical discretization method. We have then illus-

trated the need to consider the requirements of a learning context before selecting a

discretization technique. It is essential to be aware of the learning context where a

discretization method is to be developed or employed. Different learning algorithms

7 Although its assumption is suspicious to be often violated in real-world applications, naive-

Bayes learning still achieves surprisingly good classification performance. Domingos and

Pazzani (1997) suggested one reason is that the classification estimation under zero-one

loss is only a function of the sign of the probability estimation. The classification accuracy

can remain high even while the assumption violation causes poor probability estimation.


require different discretization strategies. It is unrealistic to pursue a universally op-

timal discretization approach.

References

An, A. and Cercone, N. (1999). Discretization of continuous attributes for learning classi-

fication rules. In Proceedings of the 3rd Pacific-Asia Conference on Methodologies forKnowledge Discovery and Data Mining, pages 509–514.

Bay, S. D. (2000). Multivariate discretization of continuous variables for set mining. In Pro-ceedings of the 6th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 315–319.

Bluman, A. G. (1992). Elementary Statistics, A Step By Step Approach. Wm.C.Brown

Publishers. page5-8.

Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. In

Proceedings of the European Working Session on Learning, pages 164–178.

Chmielewski, M. R. and Grzymala-Busse, J. W. (1996). Global discretization of continuous

attributes as preprocessing for machine learning. International Journal of ApproximateReasoning, 15:319–331.

Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretiza-

tion of continuous features. In Proceedings of the 12th International Conference onMachine Learning, pages 194–202.

Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued

attributes for classification learning. In Proceedings of the 13th International Joint Con-ference on Artificial Intelligence, pages 1022–1027.

Frank, E. and Witten, I. H. (1999). Making better use of global discretization. In Proceedingsof the 16th International Conference on Machine Learning, pages 115–123. Morgan

Kaufmann Publishers.

Freitas, A. A. and Lavington, S. H. (1996). Speeding up knowledge discovery in large rela-

tional databases by means of a new discretization algorithm. In Advances in Databases,Proceedings of the 14th British National Conference on Databases, pages 124–133.

Hsu, C.-N., Huang, H.-J., and Wong, T.-T. (2000). Why discretization works for naive

Bayesian classifiers. In Proceedings of the 17th International Conference on MachineLearning, pages 309–406.

Hsu, C.-N., Huang, H.-J., and Wong, T.-T. (2003). Implications of the Dirichlet assump-

tion for discretization of continuous variables in naive Bayesian classifiers. MachineLearning, 53(3):235–263.

Ishibuchi, H., Yamamoto, T., and Nakashima, T. (2001). Fuzzy Data Mining: Effect of fuzzy

discretization. In The 2001 IEEE International Conference on Data Mining.

Kerber, R. (1992). Chimerge: Discretization for numeric attributes. In National Conferenceon Artificial Intelligence, pages 123–128. AAAI Press.

Kohavi, R. and Sahami, M. (1996). Error-based and entropy-based discretization of con-

tinuous features. In Proceedings of the 2nd International Conference on KnowledgeDiscovery and Data Mining, pages 114–119.

Macskassy, S. A., Hirsh, H., Banerjee, A., and Dayanik, A. A. (2001). Using text classifiers

for numerical classification. In Proceedings of the 17th International Joint Conferenceon Artificial Intelligence.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill Companies.


Mora, L., Fortes, I., Morales, R., and Triguero, F. (2000). Dynamic discretization of con-

tinuous values from time series. In Proceedings of the 11th European Conference onMachine Learning, pages 280–291.

Pazzani, M. J. (1995). An iterative improvement approach for the discretization of numeric

attributes in Bayesian classifiers. In Proceedings of the 1st International Conference onKnowledge Discovery and Data Mining, pages 228–233.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81–106.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.

Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-

tive reports (pp. 217228). Lecture notes in artificial intelligence, 3055. Springer-Verlag

(2004).

Richeldi, M. and Rossotto, M. (1995). Class-driven statistical discretization of continuous

attributes (extended abstract). In European Conference on Machine Learning, 335-338.

Springer.

Samuels, M. L. and Witmer, J. A. (1999). Statistics For The Life Sciences, Second Edition.

Prentice-Hall. page10-11.

Wu, X. (1995). Knowledge Acquisition from Databases. Ablex Publishing Corp. Chapter 6.

Wu, X. (1996). A Bayesian discretizer for real-valued attributes. The Computer Journal,39(8):688–691.

Wu, X. (1999). Fuzzy interpretation of discretized intervals. IEEE Transactions on FuzzySystems, 7(6):753–759.

Yang, Y. (2003). Discretization for Naive-Bayes Learning. PhD thesis, School of Computer

Science and Software Engineering, Monash University, Melbourne, Australia.

Yang, Y. and Webb, G. I. (2001). Proportional k-interval discretization for naive-Bayes

classifiers. In Proceedings of the 12th European Conference on Machine Learning, pages

564–575.

Yang, Y. and Webb, G. I. (2002). Non-disjoint discretization for naive-Bayes classifiers. In

Proceedings of the 19th International Conference on Machine Learning, pages 666–673.

Yang, Y. and Webb, G. I. (2004). Discretization for naive-Bayes learning: Managing dis-

cretization bias and variance. Submitted for publication.


Tabl

e6.

2.Ta

xono

my

ofD

iscr

etiz

atio

nM

etho

ds

Taxo

nom

y(c

orre

spon

ding

toSe

ctio

n2)

Met

hod

0.1.

2.3.

4.5.

6.7.

8.9.

10.

Equ

al-w

idth

Equ

al-f

requ

ency

prim

ary

unsu

perv

ised

para

met

ric

non-

hier

arch

ical

univ

aria

tedi

sjoi

ntgl

obal

eage

rtim

e-in

sens

itive

nom

inal

non-

fuzz

yFi

xed-

freq

uenc

yM

ulti-

inte

rval

-pr

imar

ysu

perv

ised

non-

para

met

ric

hier

arch

ical

univ

aria

tedi

sjoi

ntgl

obal

eage

rtim

e-in

sens

itive

nom

inal

non-

fuzz

yen

trop

y-m

inim

izat

ion

Chi

Mer

geSt

atD

isc

prim

ary

supe

rvis

edno

n-pa

ram

etri

chi

erar

chic

alun

ivar

iate

disj

oint

glob

alea

gert

ime-

inse

nsiti

veno

min

alno

n-fu

zzy

Info

Mer

geC

lust

er-b

ased

prim

ary

unsu

perv

ised

non-

para

met

ric

hier

arch

ical

mul

tivar

iate

disj

oint

glob

alea

gert

ime-

inse

nsiti

veno

min

alno

n-fu

zzy

ID3

prim

ary

supe

rvis

edpa

ram

etri

chi

erar

chic

alun

ivar

iate

disj

oint

loca

lea

gert

ime-

inse

nsiti

veno

min

alno

n-fu

zzy

Non

-dis

join

tco

mpo

site

unsu

perv

ised

*no

n-hi

erar

chic

alun

ivar

iate

non-

disj

oint

glob

alea

gert

ime-

inse

nsiti

veno

min

alno

n-fu

zzy

Laz

yco

mpo

site

**

*un

ivar

iate

non-

disj

oint

glob

alla

zytim

e-in

sens

itive

nom

inal

non-

fuzz

yD

ynam

ic-q

ualit

ativ

epr

imar

yun

supe

rvis

edno

n-pa

ram

etri

cno

n-hi

erar

chic

alun

ivar

iate

disj

oint

loca

lla

zytim

e-se

nsiti

veno

min

alno

n-fu

zzy

Ord

inal

com

posi

te*

**

univ

aria

tedi

sjoi

ntgl

obal

eage

rtim

e-in

sens

itive

ordi

nal

non-

fuzz

yFu

zzy

com

posi

te*

**

univ

aria

teno

n-di

sjoi

ntgl

obal

eage

rtim

e-in

sens

itive

nom

inal

fuzz

yIt

erat

ive-

impr

ovem

ent

com

posi

tesu

perv

ised

*hi

erar

chic

alm

ultiv

aria

tedi

sjoi

ntgl

obal

eage

rtim

e-in

sens

itive

nom

inal

non-

fuzz

y

Not

e:ea

chen

try

ofth

eta

xono

my

is0.

prim

ary

vs.

com

posi

te;

1.su

perv

ised

vs.

unsu

perv

ised

;2.

para

met

ric

vs.

non-

para

met

ric;

3.hi

erar

chic

alvs

.no

n-hi

erar

chic

al;

4.un

ivar

iate

vs.

mul

tivar

iate

;5.

disj

oint

vs.

non-

disj

oint

;6.

glob

alvs

.lo

cal;

7.ea

ger

vs.

lazy

;8.

time-

sens

itive

vs.

time-

inse

nsiti

ve;

9.or

dina

lvs.

nom

inal

;10

.fu

zzy

vs.

non-

fuzz

y.A

nen

try

fille

dw

ith‘*

’in

dica

tes

that

the

corr

espo

ndin

gm

etho

dca

nbe

cond

ucte

din

eith

erw

ayof

the

corr

espo

ndin

gta

xono

my

entr

y.T

his

ofte

nha

ppen

sfo

rco

mpo

site

met

hods

,who

seta

xono

my

depe

nds

onth

eir

prim

ary

met

hods

.

Data Mining and Knowledge Discovery Handbook || Discretization Methods

Documents

Transcript of Data Mining and Knowledge Discovery Handbook || Discretization Methods