Data Mining- Discretization

5
DISCRETIZATION AND CONCEPT HIERARCHY GENERATION Discretization: Types of attributes: Nominal values from an unordered set, e.g., color, profession Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers Discretization: Divide the range of a continuous attribute into intervals Reduce data size by discretization Discretization and Concept Hierarchy: Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute Concept hierarchy: Concept hierarchy formation Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as

description

Data Mining- Discretization

Transcript of Data Mining- Discretization

  • DISCRETIZATION AND CONCEPT HIERARCHY GENERATION

    Discretization: Types of attributes:

    Nominal values from an unordered set, e.g., color, profession

    Ordinal values from an ordered set, e.g., military or academic rank

    Continuous real numbers, e.g., integer or real numbers

    Discretization: Divide the range of a continuous attribute into

    intervals Reduce data size by discretization

    Discretization and Concept Hierarchy: Discretization

    Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals

    Interval labels can then be used to replace actual data values

    Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an

    attribute

    Concept hierarchy: Concept hierarchy formation

    Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as

  • young, middle-aged, or senior) Detail lost More meaningful Easier to interpret Mining becomes easier Several concept hierarchies can be defined for the

    same attribute Manual / Implicit

    Discretization and Concept Hierarchy Generation for Numeric Data: Typical methods:

    Binning Histogram analysis Clustering analysis Entropy-based discretization 2 merging Segmentation by natural partitioning All the methods can be applied recursively

    Techniques: Binning

    Distribute values into bins Replace by bin mean / median Recursive application leads to concept

    hierarchies Unsupervised technique

    Histogram Analysis Data Distribution Partition Equiwidth (0-100], (100-200], Equidepth Recursive

    Minimum Interval size

  • Unsupervised Cluster Analysis

    Clusters form nodes of concept hierarchy Can decompose / combine Lower level / higher level of hierarchy

    Entropy-Based Discretization: Given a set of samples S, if S is partitioned into two

    intervals S1 and S2 using boundary T, the expected information requirement after partitioning is

    Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is

    where pi is the probability of class i in S1 The boundary that minimizes the expected

    information requirement over all possible boundaries is selected as a binary discretization

    The process is recursively applied to partitions obtained until some stopping criterion is met

    Reduces data size Class information is considered Improves accuracy

    Interval Merging by 2 Analysis: ChiMerge

    Bottom-up approach find the best neighbouring intervals and

    merges them to form larger intervals Supervised

    If two adjacent intervals have similar

  • distribution of classes they can be merged Initially each value is in a separate interval 2 tests are performed for adjacent intervals.

    Those with least values are merged Can be repeated Stopping condition (Threshold, Number of

    intervals)

    Segmentation by Natural Partitioning: A simply 3-4-5 rule can be used to segment numeric

    data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values

    at the most significant digit, partition the range into 3 equi-width intervals

    If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals

    If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

    Outliers could be present Consider only the majority values

    5th percentile 95th percentile Example of 3-4-5 Rule

    Concept Hierarchy Generation for Categorical Data: Specification of a partial ordering of attributes

    explicitly at the schema level by users or experts User / Expert defines hierarchy Street < city < state < country

    Specification of a portion of a hierarchy by explicit data grouping Manual

  • Intermediate level information specified Industrial, Agricultural..

    Specification of a set of attributes but not their partial ordering Automatically inferring the hierarchy Heuristic rule

    High level concepts contain a smaller number of values

    Specification of only a partial set of attributes Embedding data semantics Attributes with tight semantic connections are

    pinned together

    DISCRETIZATION AND CONCEPT HIERARCHY GENERATIONDiscretization: Types of attributes: Nominal values from an unordered set, e.g., color, profession Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers

    Discretization: Divide the range of a continuous attribute into intervals Reduce data size by discretization

    Discretization and Concept Hierarchy: Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute

    Concept hierarchy: Concept hierarchy formation Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)

    Detail lost More meaningful Easier to interpret Mining becomes easier Several concept hierarchies can be defined for the same attribute Manual / Implicit

    Discretization and Concept Hierarchy Generation for Numeric Data: Typical methods: Binning Histogram analysis Clustering analysis Entropy-based discretization (2 merging Segmentation by natural partitioningAll the methods can be applied recursively

    Techniques: Binning Distribute values into bins Replace by bin mean / median Recursive application leads to concept hierarchies Unsupervised technique

    Histogram Analysis Data Distribution Partition Equiwidth (0-100], (100-200], Equidepth Recursive Minimum Interval size

    Unsupervised

    Cluster Analysis Clusters form nodes of concept hierarchy Can decompose / combine Lower level / higher level of hierarchy

    Entropy-Based Discretization: Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the expected information requirement after partitioning is Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 iswhere pi is the probability of class i in S1

    The boundary that minimizes the expected information requirement over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met Reduces data size Class information is considered Improves accuracy

    Interval Merging by (2 Analysis: ChiMerge Bottom-up approach find the best neighbouring intervals and merges them to form larger intervals

    Supervised If two adjacent intervals have similar distribution of classes they can be merged

    Initially each value is in a separate interval (2 tests are performed for adjacent intervals. Those with least values are merged Can be repeated Stopping condition (Threshold, Number of intervals)

    Segmentation by Natural Partitioning: A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

    Outliers could be present Consider only the majority values 5th percentile 95th percentile

    Example of 3-4-5 RuleConcept Hierarchy Generation for Categorical Data: Specification of a partial ordering of attributes explicitly at the schema level by users or experts User / Expert defines hierarchy Street < city < state < country

    Specification of a portion of a hierarchy by explicit data grouping Manual Intermediate level information specified Industrial, Agricultural..

    Specification of a set of attributes but not their partial ordering Automatically inferring the hierarchy Heuristic rule High level concepts contain a smaller number of values

    Specification of only a partial set of attributes Embedding data semantics Attributes with tight semantic connections are pinned together