Data Mining- Discretization
-
Upload
raj-endran -
Category
Documents
-
view
5 -
download
2
description
Transcript of Data Mining- Discretization
-
DISCRETIZATION AND CONCEPT HIERARCHY GENERATION
Discretization: Types of attributes:
Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank
Continuous real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into
intervals Reduce data size by discretization
Discretization and Concept Hierarchy: Discretization
Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an
attribute
Concept hierarchy: Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as
-
young, middle-aged, or senior) Detail lost More meaningful Easier to interpret Mining becomes easier Several concept hierarchies can be defined for the
same attribute Manual / Implicit
Discretization and Concept Hierarchy Generation for Numeric Data: Typical methods:
Binning Histogram analysis Clustering analysis Entropy-based discretization 2 merging Segmentation by natural partitioning All the methods can be applied recursively
Techniques: Binning
Distribute values into bins Replace by bin mean / median Recursive application leads to concept
hierarchies Unsupervised technique
Histogram Analysis Data Distribution Partition Equiwidth (0-100], (100-200], Equidepth Recursive
Minimum Interval size
-
Unsupervised Cluster Analysis
Clusters form nodes of concept hierarchy Can decompose / combine Lower level / higher level of hierarchy
Entropy-Based Discretization: Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the expected information requirement after partitioning is
Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is
where pi is the probability of class i in S1 The boundary that minimizes the expected
information requirement over all possible boundaries is selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping criterion is met
Reduces data size Class information is considered Improves accuracy
Interval Merging by 2 Analysis: ChiMerge
Bottom-up approach find the best neighbouring intervals and
merges them to form larger intervals Supervised
If two adjacent intervals have similar
-
distribution of classes they can be merged Initially each value is in a separate interval 2 tests are performed for adjacent intervals.
Those with least values are merged Can be repeated Stopping condition (Threshold, Number of
intervals)
Segmentation by Natural Partitioning: A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values
at the most significant digit, partition the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
Outliers could be present Consider only the majority values
5th percentile 95th percentile Example of 3-4-5 Rule
Concept Hierarchy Generation for Categorical Data: Specification of a partial ordering of attributes
explicitly at the schema level by users or experts User / Expert defines hierarchy Street < city < state < country
Specification of a portion of a hierarchy by explicit data grouping Manual
-
Intermediate level information specified Industrial, Agricultural..
Specification of a set of attributes but not their partial ordering Automatically inferring the hierarchy Heuristic rule
High level concepts contain a smaller number of values
Specification of only a partial set of attributes Embedding data semantics Attributes with tight semantic connections are
pinned together
DISCRETIZATION AND CONCEPT HIERARCHY GENERATIONDiscretization: Types of attributes: Nominal values from an unordered set, e.g., color, profession Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals Reduce data size by discretization
Discretization and Concept Hierarchy: Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute
Concept hierarchy: Concept hierarchy formation Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)
Detail lost More meaningful Easier to interpret Mining becomes easier Several concept hierarchies can be defined for the same attribute Manual / Implicit
Discretization and Concept Hierarchy Generation for Numeric Data: Typical methods: Binning Histogram analysis Clustering analysis Entropy-based discretization (2 merging Segmentation by natural partitioningAll the methods can be applied recursively
Techniques: Binning Distribute values into bins Replace by bin mean / median Recursive application leads to concept hierarchies Unsupervised technique
Histogram Analysis Data Distribution Partition Equiwidth (0-100], (100-200], Equidepth Recursive Minimum Interval size
Unsupervised
Cluster Analysis Clusters form nodes of concept hierarchy Can decompose / combine Lower level / higher level of hierarchy
Entropy-Based Discretization: Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the expected information requirement after partitioning is Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 iswhere pi is the probability of class i in S1
The boundary that minimizes the expected information requirement over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met Reduces data size Class information is considered Improves accuracy
Interval Merging by (2 Analysis: ChiMerge Bottom-up approach find the best neighbouring intervals and merges them to form larger intervals
Supervised If two adjacent intervals have similar distribution of classes they can be merged
Initially each value is in a separate interval (2 tests are performed for adjacent intervals. Those with least values are merged Can be repeated Stopping condition (Threshold, Number of intervals)
Segmentation by Natural Partitioning: A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
Outliers could be present Consider only the majority values 5th percentile 95th percentile
Example of 3-4-5 RuleConcept Hierarchy Generation for Categorical Data: Specification of a partial ordering of attributes explicitly at the schema level by users or experts User / Expert defines hierarchy Street < city < state < country
Specification of a portion of a hierarchy by explicit data grouping Manual Intermediate level information specified Industrial, Agricultural..
Specification of a set of attributes but not their partial ordering Automatically inferring the hierarchy Heuristic rule High level concepts contain a smaller number of values
Specification of only a partial set of attributes Embedding data semantics Attributes with tight semantic connections are pinned together