02 Data Preprocessing

download 02 Data Preprocessing

of 68

Transcript of 02 Data Preprocessing

  • 8/7/2019 02 Data Preprocessing

    1/68

    1

    2

  • 8/7/2019 02 Data Preprocessing

    2/68

    2

    1. Why preprocess the data?

    2. Descriptive data summarization

    3. Data cleaning

    4. Data integration and transformation

    5. Data reduction

    6. Discretization and concept hierarchy generation

    This chapter covers

  • 8/7/2019 02 Data Preprocessing

    3/68

    3

    Data in the real world is dirty

    incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregate data

    e.g., occupation=

    noisy: containing errors or outliers

    e.g., Salary=-10

    inconsistent: containing discrepancies in codes or names

    e.g., Age=42 Birthday=03/07/1997

    e.g., Was rating 1,2,3, now rating A, B, C

    e.g., discrepancy between duplicate records

    The real-world data are highly susceptible to noise, incompleteness and

    inconsistency.

  • 8/7/2019 02 Data Preprocessing

    4/68

    4

    Why Is Data Dirty?

  • 8/7/2019 02 Data Preprocessing

    5/68

  • 8/7/2019 02 Data Preprocessing

    6/68

    6

    Inconsistent data may come from

    Different data sources

    Functional dependency violation (e.g., modify some linked data)

    Duplicate records also need data cleaning

  • 8/7/2019 02 Data Preprocessing

    7/68

    7

    Why Is Data Preprocessing Important?

    No quality data, no quality miningresults!

    Quality decisions must be based

    on quality data

    e.g., duplicate or missing data

    may cause incorrect or evenmisleading statistics.

    Data warehouse needs consistent

    integration of quality data

    Data extraction, cleaning, and

    transformation comprises the

    majority of the work of building a

    data warehouse

    Noisy Data

    Lack of Data Quality

    Lack of Quality in

    Query Results

    Lack of Quality

    Information

    Lack of Quality in

    Mining

    Lack of Quality

    Decision Making

  • 8/7/2019 02 Data Preprocessing

    8/68

    8

    Measures of Data Quality

    Accuracy

    Completeness

    Consistency

    Timeliness

    Believability

    Value added

    Interpretability

    Accessibility

  • 8/7/2019 02 Data Preprocessing

    9/68

    9

    Major Tasks in Data Preprocessing

    Technique Purpose

    1. Data cleaning It can be applied to remove noise and correct

    inconsistencies in the data.

    2. Data integration It merges data from multiple sources into a

    coherent data store, such as a data warehouse.

    3. Data transformation These are like normalizations.

    4. Data reduction It can reduce the data size by aggregating,

    eliminating redundant features, or clustering.

    5. Data discretization Part of data reduction but with particular

    importance, especially for numerical data

  • 8/7/2019 02 Data Preprocessing

    10/68

    10

    Data preprocessing techniques are applied before mining.

    These can improve the overall quality of the patterns mined andthe time required for the actual mining.

  • 8/7/2019 02 Data Preprocessing

    11/68

    11

    For data preprocessing to be successful, it is essential

    to have an overall picture of the data.

    For many preprocessing tasks, users would like to

    learn about data characteristics regarding both central

    tendency and dispersion of the data.

  • 8/7/2019 02 Data Preprocessing

    12/68

    12

    Measuring the central tendency

    A distributive measure is a measure that can be computed for

    a give data set by partitioning the data into smaller subsets,

    and then merging the results in order to arrive at the

    measures value for the entire data set. For example, sum(),

    count().

    An algebraic measure is a measure that can be computed by

    applying an algebraic function to one or more distributive

    measures. For example, mean, median, mode and midrange.

  • 8/7/2019 02 Data Preprocessing

    13/68

    13

    Measuring the dispersion

    The degree to which numerical data tend to spread is

    called the dispersion or variance of data. For example,

    range, QD, SD, quantile plot, scatter plot.

  • 8/7/2019 02 Data Preprocessing

    14/68

    14

    Real-world data tend to be incomplete, noisy, and inconsistent.

    Data cleaning routines attempt

    to fill in missing values

    to smooth out noise while identifying outliers

    to correct inconsistencies in the data

    to resolve redundancy caused by data integration.

  • 8/7/2019 02 Data Preprocessing

    15/68

    15

    1.Missing Data

    Data is not always available E.g., many tuples have no recorded value for several attributes, such as

    customer income in sales data

    Missing data may be due to

    equipment malfunction

    inconsistent with other recorded data and thus deleted

    data not entered due to misunderstanding

    certain data may not be considered important at the time of entry

    not register history or changes of the data

    Missing data may need to be inferred

  • 8/7/2019 02 Data Preprocessing

    16/68

    16

    How to Handle Missing Data?

    The methods are as follows:

    Ignore the tuple

    Fill in the missing values manually

    This is usually done when the class label is missing. This method is

    not very effective, unless the tuple contains several attributes with

    missing values.

    Use global constant to fill in the missing value

    Use the attribute mean to fill in the missing value

    Use the most probable value to fill in the missing value

    It is time consuming.

    Not possible for large data sets with many missing values.

  • 8/7/2019 02 Data Preprocessing

    17/68

    17

    2. Noisy Data

    What is noise?

    It is random error or variance in a measured variable.

  • 8/7/2019 02 Data Preprocessing

    18/68

    18

    Incorrect attribute values may due to

    faulty data collection instruments

    data entry problems

    data transmission problems

    technology limitation inconsistency in naming convention

    Why noise occurs?

  • 8/7/2019 02 Data Preprocessing

    19/68

    19

    How to Handle Noisy Data?

    Technique How applied

    1. Binning First sort data and partition into (equal-

    frequency) bins

    Then one can smooth by bin means, bin

    median, smooth by bin boundaries, etc.

    2. Regression Smooth by fitting the data into regression

    functions.

    3. Clustering Detect and remove outliers

    4. Combined computer andhuman inspection

    Detect suspicious values and check by human.

  • 8/7/2019 02 Data Preprocessing

    20/68

    20

    Simple Discretization Methods: Binning

    Equal-width (distance) partitioning

    Divides the range into Nintervals of equal size: uniform grid

    ifA and B are the lowest and highest values of the attribute, the width

    of intervals will be: W= (B A)/N.

    The most straightforward, but outliers may dominate presentation

    Skewed data is not handled well

    Equal-depth (frequency) partitioning

    Divides the range into Nintervals, each containing approximately

    same number of samples

    Good data scaling

    Managing categorical attributes can be tricky

  • 8/7/2019 02 Data Preprocessing

    21/68

    21

    Binning Methods -- Examples

    Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

    * Partition into equal-frequency (equi-depth) bins:

    - Bin 1: 4, 8, 9, 15

    - Bin 2: 21, 21, 24, 25

    - Bin 3: 26, 28, 29, 34

    * Smoothing by bin means:- Bin 1: 9, 9, 9, 9

    - Bin 2: 23, 23, 23, 23

    - Bin 3: 29, 29, 29, 29

    * Smoothing by bin boundaries:

    - Bin 1: 4, 4, 4, 15

    - Bin 2: 21, 21, 25, 25

    - Bin 3: 26, 26, 26, 34

  • 8/7/2019 02 Data Preprocessing

    22/68

    22

    Regression

    x

    y

    y = x + 1

    X1

    Y1

    Y1

    Data can be smoothed by fitting the data to a function such as regression.

  • 8/7/2019 02 Data Preprocessing

    23/68

    23

    Cluster Analysis

    Outliers may be detected by clustering, where similar values are organized into

    groups, or clusters.

    Values that fall outside of the set of clusters may be considered outliers.

  • 8/7/2019 02 Data Preprocessing

    24/68

    24

    Data integration combines data from multiple sources into a coherent data store.

    These sources may include multiple databases, data cubes, or flat files.

  • 8/7/2019 02 Data Preprocessing

    25/68

    25

    Entity identification problem:

    Identify real world entities from multiple data sources, e.g., BillClinton = William Clinton

    Detecting and resolving data value conflicts

    For the same real world entity, attribute values from different sourcesare different

    Possible reasons: different representations, different scales, e.g.,metric vs. British units

    Data integration issues:

    Contd

  • 8/7/2019 02 Data Preprocessing

    26/68

    26

    Data integration issues:

    Schema integration and object matching can be tricky

    Redundancy

    Entity identification problem

    How can equivalent real-world entities from multiple data stores be matched up?

    Duplication

    An attribute may be redundant if it can be derived from another attribute or set

    of attributes.

    Some redundancies can be detected by correlation analysis.

    Detection and resolution of data value conflicts

  • 8/7/2019 02 Data Preprocessing

    27/68

    27

    WhyRedundancyCauses Problems?

    Consider the following tables:

    EMP(ENO, ENAME, BASIC, DA, PAY) -> PAY = BASIC + DA

    EMPLOYEE(ENO, ENAME, BASIC, DA, PF, PAY) -> PAY = BASIC + DA - PF

    In the same way ITEM-PRICE will be determined by local taxes, which vary

    from area to area.

    If redundant variables are numeric, it is better first to normalize them

    before integrating data from multiple sources.

  • 8/7/2019 02 Data Preprocessing

    28/68

    28

    Handling Redundancy in Data Integration

    Redundant data occur often when integration of multiple databases

    Objectidentification: The same attribute or object may have

    different names in different databases

    Derivabledata: One attribute may be a derived attribute in

    another table, e.g., annual revenue

    Redundant attributes may be able to be detected by correlationanalysis

    Careful integration of the data from multiple sources may help

    reduce/avoid redundancies and inconsistencies and improve mining

    speed and quality

  • 8/7/2019 02 Data Preprocessing

    29/68

    29

    Correlation Analysis (Numerical Data)

    Correlation coefficient (also called Pearsons product moment coefficient)

    where n is the number of tuples, and are the respective means of

    A and B, A and B are the respective standard deviation of A and B, and

    (AB) is the sum of the AB cross-product.

    If rA,B > 0, A and B are positively correlated (As values increase as Bs).

    The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated

    BABA n

    BAnAB

    n

    BBAAr

    BA

    WWWW )1(

    )(

    )1(

    ))((,

    !

    !

    A B

  • 8/7/2019 02 Data Preprocessing

    30/68

    30

    Correlation Analysis (Categorical Data)

    G2 (chi-square) test

    The larger the G2 value, the more likely the variables are related

    The cells that contribute the most to the G2 value are those whoseactual count is very different from the expected count

    Correlation does not imply causality

    # of hospitals and # of car-theft in a city are correlated

    Both are causally linked to the third variable: population

    !Expected

    ExpectedObserved 22 )(G

  • 8/7/2019 02 Data Preprocessing

    31/68

    31

    Chi-Square Calculation: An Example

    G2 (chi-square) calculation (numbers in parenthesis are expected counts

    calculated based on the data distribution in the two categories)

    It shows that like_science_fiction and play_chess are correlated in the

    group

    93.507

    840

    )8401000(

    360

    )360200(

    210

    )21050(

    90

    )90250( 22222 !

    !G

    Play chess Not play chess Sum (row)

    Like science fiction 250(90) 200(360) 450

    Not like science fiction 50(210) 1000(840) 1050

    Sum(col.) 300 1200 1500

  • 8/7/2019 02 Data Preprocessing

    32/68

    32

    Here, the data are transformed or consolidated into forms appropriate for mining.

  • 8/7/2019 02 Data Preprocessing

    33/68

    33

    Smoothing: remove noise from data through binning,regression and clustering

    Aggregation: summarization, data cube construction

    Generalization: concept hierarchy climbing

    Normalization: scaled to fall within a small, specified range

    min-max normalization

    z-score normalization

    normalization by decimal scaling

    Attribute/feature construction

    New attributes constructed from the given ones

    Data transformation can involve the following:

  • 8/7/2019 02 Data Preprocessing

    34/68

    34

    Data Transformation: Normalization

    Min-max normalization: to [new_minA, new_maxA]

    Ex. Let income range $12,000 to $98,000 normalized to [0.0,

    1.0]. Then $73,600 is mapped to

    Z-score normalization (: mean, : standard deviation):

    Ex. Let =54,000, =16,000. Then

    Normalization by decimal scaling

    716.00)00.1(000,12000,98

    000,12600,73!

    AAA

    AA

    A

    minnewminnewmaxnewminmax

    minvv _)__('

    !

    A

    Avv

    W

    Q!'

    j

    vv

    10'! Wherej is thesmallest integersuch that Max(||) < 1

    225.1

    000,16

    000,54600,73!

  • 8/7/2019 02 Data Preprocessing

    35/68

    35

  • 8/7/2019 02 Data Preprocessing

    36/68

    36

    Why data reduction?

    A database/data warehouse may store terabytes of data

    Complex data analysis/mining may take a very long time to run on

    the complete data set

  • 8/7/2019 02 Data Preprocessing

    37/68

    37

    Data reduction techniques can be applied to obtain a reduced representation of

    the data set that is much smaller in volume, yet closely maintains the integrity of

    the original data.

    That is, mining on the reduced data set should be more efficient yet produce the

    same analytical results.

  • 8/7/2019 02 Data Preprocessing

    38/68

    38

    Data reduction strategies

    1. Data cube aggregation:

    2. Attribute subset selection

    3. Data Compression or dimensionality reduction

    4. Numerosity reduction e.g., fit data into models

  • 8/7/2019 02 Data Preprocessing

    39/68

    39

    1. Data Cube Aggregation

    Data cubes store multidimensional aggregated information.

    The cube created at the lowest level of abstraction is referred to as the basecuboid.

    The base cuboid should correspond to an individual entity of interest, such as sales

    or customer. The lowest level should be useful for analysis.

    A cube at the highest level of abstraction is apexcuboid.

  • 8/7/2019 02 Data Preprocessing

    40/68

    40

    Data cubes created for varying levels of abstraction are referred to as

    cuboids.

    Each higher level of abstraction further reduces the resulting data size.

    When replying to data mining requests, the smallest available cuboid

    relevant to the given task should be used.

  • 8/7/2019 02 Data Preprocessing

    41/68

    41

    2. Attribute selection

    Attribute subset selection reduces the data set size by removing

    irrelevant or redundant attributes or dimensions.

    The goal here is to find a minimum set of attributes such that the

    resulting probability distribution of data classes is as close as

    possible to the original distribution obtained using all attributes.

    The best attributes are typically determined using tests of statistical

    significance, which assume that the attributes are independent of one

    another.

  • 8/7/2019 02 Data Preprocessing

    42/68

    42

    Basic heuristic methods of attribute subset selection include the

    following techniques:

    1. Stepwise forward selection

    2. Stepwise backward elimination

    3. Combination of forward and backward elimination4. Decision tree induction

  • 8/7/2019 02 Data Preprocessing

    43/68

    43

    3. Data Compression or dimensionality reduction

    Here, the data encoding or transformations are applied so as to obtain a reducedor compressed representation of the original data.

    Lossy compression

    Lossless compression

    Two lossy compression techniques:

    1. Wavelet transforms

    Discrete wavelet transform

    Discrete Fourier transform

    Hierarchical pyramid algorithm

    2. Principal component analysis (PCA)

  • 8/7/2019 02 Data Preprocessing

    44/68

  • 8/7/2019 02 Data Preprocessing

    45/68

    45

    String compression

    There are extensive theories and well-tuned algorithms

    Typically lossless

    But only limited manipulation is possible without expansion

    Audio/video compression Typically lossy compression, with progressive refinement

    Sometimes small fragments of signal can be reconstructed without

    reconstructing the whole

    Time sequence is not audio

    Typically short and vary slowly with time

    Examples

  • 8/7/2019 02 Data Preprocessing

    46/68

    46

    Dimensionality Reduction:Wavelet Transformation

    Discrete wavelet transform (DWT): linear signal processing, multi-

    resolutional analysis

    Compressed approximation: store only a small fraction of the strongest

    of the wavelet coefficients

    Similar to discrete Fourier transform (DFT), but better lossycompression, localized in space

    Method:

    Length, L, must be an integer power of 2 (padding with 0s, when necessary)

    Each transform has 2 functions: smoothing, difference

    Applies to pairs of data, resulting in two set of data of length L/2

    Applies two functions recursively, until reaches the desired length

    Haar

    2 Daubechie

    4

  • 8/7/2019 02 Data Preprocessing

    47/68

    47

    Given Ndata vectors from n-dimensions, find k n orthogonal vectors

    (principalcomponents) that can be best used to represent data Steps

    Normalize input data: Each attribute falls within the same range

    Compute korthonormal (unit) vectors, i.e., principalcomponents

    Each input data (vector) is a linear combination of the kprincipal component

    vectors The principal components are sorted in order of decreasing significance or

    strength

    Since the components are sorted, the size of the data can be reduced by

    eliminating the weak components, i.e., those with low variance. (i.e., using the

    strongest principal components, it is possible to reconstruct a good

    approximation of the original data

    Works for numeric data only

    Used when the number of dimensions is large

    DimensionalityReduction: PrincipalComponent Analysis (PCA)

  • 8/7/2019 02 Data Preprocessing

    48/68

    48

    X1

    X2

    Y1

    Y2

    PrincipalComponent Analysis

  • 8/7/2019 02 Data Preprocessing

    49/68

    49

    4. Numerosity reduction

    The techniques of numerosity reduction can be applied to reduce the data

    volume by choosing alternative, smaller forms of data representation.

    These techniques may be parametric or nonparametric.

    For parametric methods, a model is used to estimate the data parameters need

    to be stored, instead of the actual data. Outliers may also be stored.

    Nonparametric methods for storing reduced representations of the data include

    histograms, clustering and sampling.

  • 8/7/2019 02 Data Preprocessing

    50/68

    50

    Data Reduction Method (1): Regression and Log-LinearModels

    Linear regression: Data are modeled to fit a straight line

    Often uses the least-square method to fit the line

    Multiple regression: allows a response variable Y to be modeled as a linear

    function of multidimensional feature vector

    Log-linear model: approximates discrete multidimensional probability

    distributions

  • 8/7/2019 02 Data Preprocessing

    51/68

    Linear regression: Y= w X + b

    Two regression coefficients, wand b, specify the line and are to be estimatedby using the data at hand

    Using the least squares criterion to the known values ofY1,Y2, , X1, X2, .

    Multiple regression: Y= b0 + b1 X1 + b2 X2.

    Many nonlinear functions can be transformed into the above Log-linear models:

    The multi-way table of joint probabilities is approximated by a product oflower-order tables

    Probability: p(a,b,c,d) = EabFacGadHbcd

    Regress Analysis and Log-LinearModels

    Data Reduction Method (1):

  • 8/7/2019 02 Data Preprocessing

    52/68

    52

    Data Reduction Method (2): Histograms

    Divide data into buckets and store

    average (sum) for each bucket

    Partitioning rules:

    Equal-width: equal bucket range

    Equal-frequency (or equal-depth)

    V-optimal: with the leasth

    istogram

    variance (weighted sum of the original

    values that each bucket represents)

    MaxDiff: set bucket boundary between

    each pair for pairs have the 1 largest

    differences

    0

    5

    10

    15

    20

    25

    30

    35

    40

    10000 30000 50000 70000 90000

  • 8/7/2019 02 Data Preprocessing

    53/68

    53

    Data Reduction Method (3): Clustering

    Partition data set into clusters based on similarity, and store cluster

    representation (e.g., centroid and diameter) only

    Can be very effective if data is clustered but not if data is smeared

    Can have hierarchical clustering and be stored in multi-dimensional index treestructures

    There are many choices of clustering definitions and clustering algorithms

  • 8/7/2019 02 Data Preprocessing

    54/68

    54

    Data Reduction Method (4): Sampling

    Sampling: obtaining a small sample s to represent the whole data setN

    Allow a mining algorithm to run in complexity that is potentially sub-linear tothe size of the data

    Choose a representative subset of the data

    Simple random sampling may have very poor performance in the presenceof skew

    Develop adaptive sampling methods

    Stratified sampling:

    Approximate the percentage of each class (or subpopulation of interest)in the overall database

    Used in conjunction with skewed data Note: Sampling may not reduce database I/Os (page at a time)

  • 8/7/2019 02 Data Preprocessing

    55/68

    55

    Sampling: with or withoutReplacement

    Raw Data

  • 8/7/2019 02 Data Preprocessing

    56/68

    56

    Sampling: Cluster orStratifiedSampling

    Raw Data Cluster/StratifiedSample

  • 8/7/2019 02 Data Preprocessing

    57/68

    57

    Three types of attributes: Nominal values from an unordered set, e.g., color, profession

    Ordinal values from an ordered set, e.g., military or academic rank

    Continuous real numbers, e.g., integer or real numbers

    Discretization:

    Divide the range of a continuous attribute into intervals

    Some classification algorithms only accept categorical attributes.

    Reduce data size by discretization

    Prepare for further analysis

  • 8/7/2019 02 Data Preprocessing

    58/68

    58

    Discretization

    Reduce the number of values for a given continuous attribute by dividing

    the range of the attribute into intervals

    Interval labels can then be used to replace actual data values

    Supervised vs. unsupervised

    Split (top-down) vs. merge (bottom-up)

    Discretization can be performed recursively on an attribute

    Concept hierarchy formation

    Recursively reduce the data by collecting and replacing low level concepts

    (such as numeric values for age) by higher level concepts (such as young,middle-aged, or senior)

  • 8/7/2019 02 Data Preprocessing

    59/68

    59

    Discretization techniques can be categorized based on how the discretization is

    performed.

    1. Supervised discretization:

    2. Unsupervised discretization:

    Here, the process uses the class information.

    Top-down discretization or splitting:

    Bottom-up discretization or merging:

    Here, the process starts by first finding one or a few points (split or cut

    points) to split the entire attribute range, and then repeats recursively

    on the resulting intervals.

    Discretization can be performed recursively on an attribute to provide ahierarchical partitioning of the attribute values, known as concept hierarchy.

    Concept hierarchies are useful for mining at multiple levels of abstraction.

  • 8/7/2019 02 Data Preprocessing

    60/68

    60

    Typical methods:All the methods can be applied recursively

    Binning (covered above)

    Top-down split, unsupervised,

    Histogram analysis (covered above)

    Top-down split, unsupervised

    Clustering analysis (covered above) Either top-down split or bottom-up merge, unsupervised

    Entropy-based discretization: supervised, top-down split

    Interval merging by G2 Analysis: unsupervised, bottom-up merge

    Segmentation by natural partitioning: top-down split, unsupervised

  • 8/7/2019 02 Data Preprocessing

    61/68

    61

    Entropy-Based Discretization

    Given a set of samples S, if S is partitioned into two intervals S1 and S2 using

    boundary T, the information gain after partitioning is

    Entropy is calculated based on class distribution of the samples in the set. Given m

    classes, the entropy ofS1

    is

    where pi is the probability of class i in S1

    The boundary that minimizes the entropy function over all possible boundaries is

    selected as a binary discretization

    The process is recursively applied to partitions obtained until some stopping

    criterion is met

    Such a boundary may reduce data size and improve classification accuracy

    )(||

    ||)(

    ||

    ||),( 2

    21

    1SEntropy

    S

    SSEntropy

    S

    STSI !

    !

    !m

    i

    iippSEntropy

    1

    21 )(log)(

  • 8/7/2019 02 Data Preprocessing

    62/68

    62

    IntervalMerge byG2 Analysis

    Merging-based (bottom-up) vs. splitting-based methods

    Merge: Find the best neighboring intervals and merge them to form larger intervals

    recursively

    ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

    Initially, each distinct value of a numerical attr. A is considered to be one interval

    G2 tests are performed for every pair of adjacent intervals

    Adjacent intervals with the leastG2 values are merged together, since low G2 values for a

    pair indicate similar class distributions

    This merge process proceeds recursively until a predefined stopping criterion is met

    (such as significance level, max-interval, max inconsistency, etc.)

  • 8/7/2019 02 Data Preprocessing

    63/68

    63

    Segmentation by Natural Partitioning

    A simply 3-4-5 rule can be used to segment numeric data into relatively

    uniform, natural intervals.

    If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,

    partition the range into 3 equi-width intervals

    If it covers 2, 4, or 8 distinct values at the most significant digit, partition

    the range into 4 intervals

    If it covers 1, 5, or 10 distinct values at the most significant digit, partition

    the range into 5 intervals

  • 8/7/2019 02 Data Preprocessing

    64/68

  • 8/7/2019 02 Data Preprocessing

    65/68

    65

    A concept hierarchyfor a numerical attribute defines a discretization of the attribute.

    Concept hierarchies can be used to reduce the data by collecting and replacing low-

    level concepts (such as numerical values for age) with higher-level concepts (such as

    youth, middle-aged, or senior).

    The high level concepts are useful for data generalization.

    The discretization techniques and concept hierarchies are applied before data mining

    as a preprocessing step, rather than during mining.

  • 8/7/2019 02 Data Preprocessing

    66/68

    66

    ConceptHierarchy Generation forCategorical Data

    Specification of a partial/total ordering of attributes explicitly at theschema level by users or experts

    street < city < state < country

    Specification of a hierarchy for a set of values by explicit data grouping

    {Urbana, Champaign, Chicago}