Data Mining- Discretization

DISCRETIZATION AND CONCEPT HIERARCHY GENERATION

Discretization: Types of attributes:

Nominal values from an unordered set, e.g., color, profession

Ordinal values from an ordered set, e.g., military or academic rank

Continuous real numbers, e.g., integer or real numbers

Discretization: Divide the range of a continuous attribute into

intervals Reduce data size by discretization

Discretization and Concept Hierarchy: Discretization

Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data values

Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an

attribute

Concept hierarchy: Concept hierarchy formation

Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as

young, middle-aged, or senior) Detail lost More meaningful Easier to interpret Mining becomes easier Several concept hierarchies can be defined for the

same attribute Manual / Implicit

Discretization and Concept Hierarchy Generation for Numeric Data: Typical methods:

Binning Histogram analysis Clustering analysis Entropy-based discretization 2 merging Segmentation by natural partitioning All the methods can be applied recursively

Techniques: Binning

Distribute values into bins Replace by bin mean / median Recursive application leads to concept

hierarchies Unsupervised technique

Histogram Analysis Data Distribution Partition Equiwidth (0-100], (100-200], Equidepth Recursive

Minimum Interval size

Unsupervised Cluster Analysis

Clusters form nodes of concept hierarchy Can decompose / combine Lower level / higher level of hierarchy

Entropy-Based Discretization: Given a set of samples S, if S is partitioned into two

intervals S1 and S2 using boundary T, the expected information requirement after partitioning is

Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the expected

information requirement over all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Reduces data size Class information is considered Improves accuracy

Interval Merging by 2 Analysis: ChiMerge

Bottom-up approach find the best neighbouring intervals and

merges them to form larger intervals Supervised

If two adjacent intervals have similar

distribution of classes they can be merged Initially each value is in a separate interval 2 tests are performed for adjacent intervals.

Those with least values are merged Can be repeated Stopping condition (Threshold, Number of

intervals)

Segmentation by Natural Partitioning: A simply 3-4-5 rule can be used to segment numeric

data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values

at the most significant digit, partition the range into 3 equi-width intervals

If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

Outliers could be present Consider only the majority values

5th percentile 95th percentile Example of 3-4-5 Rule

Concept Hierarchy Generation for Categorical Data: Specification of a partial ordering of attributes

explicitly at the schema level by users or experts User / Expert defines hierarchy Street < city < state < country

Specification of a portion of a hierarchy by explicit data grouping Manual

Intermediate level information specified Industrial, Agricultural..

Specification of a set of attributes but not their partial ordering Automatically inferring the hierarchy Heuristic rule

High level concepts contain a smaller number of values

Specification of only a partial set of attributes Embedding data semantics Attributes with tight semantic connections are

pinned together

DISCRETIZATION AND CONCEPT HIERARCHY GENERATIONDiscretization: Types of attributes: Nominal values from an unordered set, e.g., color, profession Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers

Discretization: Divide the range of a continuous attribute into intervals Reduce data size by discretization

Discretization and Concept Hierarchy: Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute

Concept hierarchy: Concept hierarchy formation Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)

Detail lost More meaningful Easier to interpret Mining becomes easier Several concept hierarchies can be defined for the same attribute Manual / Implicit

Discretization and Concept Hierarchy Generation for Numeric Data: Typical methods: Binning Histogram analysis Clustering analysis Entropy-based discretization (2 merging Segmentation by natural partitioningAll the methods can be applied recursively

Techniques: Binning Distribute values into bins Replace by bin mean / median Recursive application leads to concept hierarchies Unsupervised technique

Histogram Analysis Data Distribution Partition Equiwidth (0-100], (100-200], Equidepth Recursive Minimum Interval size

Unsupervised

Cluster Analysis Clusters form nodes of concept hierarchy Can decompose / combine Lower level / higher level of hierarchy

Entropy-Based Discretization: Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the expected information requirement after partitioning is Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 iswhere pi is the probability of class i in S1

The boundary that minimizes the expected information requirement over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met Reduces data size Class information is considered Improves accuracy

Interval Merging by (2 Analysis: ChiMerge Bottom-up approach find the best neighbouring intervals and merges them to form larger intervals

Supervised If two adjacent intervals have similar distribution of classes they can be merged

Initially each value is in a separate interval (2 tests are performed for adjacent intervals. Those with least values are merged Can be repeated Stopping condition (Threshold, Number of intervals)

Segmentation by Natural Partitioning: A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

Outliers could be present Consider only the majority values 5th percentile 95th percentile

Example of 3-4-5 RuleConcept Hierarchy Generation for Categorical Data: Specification of a partial ordering of attributes explicitly at the schema level by users or experts User / Expert defines hierarchy Street < city < state < country

Specification of a portion of a hierarchy by explicit data grouping Manual Intermediate level information specified Industrial, Agricultural..

Specification of a set of attributes but not their partial ordering Automatically inferring the hierarchy Heuristic rule High level concepts contain a smaller number of values

Specification of only a partial set of attributes Embedding data semantics Attributes with tight semantic connections are pinned together

Data Mining- Discretization

Documents

Transcript of Data Mining- Discretization