DM Lecture 4

download DM Lecture 4

of 24

Transcript of DM Lecture 4

  • 8/8/2019 DM Lecture 4

    1/24

    Lecture 4 Data Pre-processing

    Fall 2010

    Dr. Tariq MAHMOODNUCES (FAST) KHI

    1

  • 8/8/2019 DM Lecture 4

    2/24

    November 25, 2010 Data Mining: Concepts and Techniques 2

    Reduce data volume by choosing alternative,smaller forms of data representationParametric methods

    Assume the data fits some model, estimatemodel parameters, store only the parameters,and discard the data (except possible outliers)Example: Log-linear models obtain value ata point in m-D space as the product onappropriate marginal subspaces

    Non-parametric methodsDo not assume modelsMajor families: histograms, clustering,sampling.

  • 8/8/2019 DM Lecture 4

    3/24

    November 25, 2010 Data Mining: Concepts and Techniques 3

    Linear regression : Data are modeled to fit a straight

    line

    Often uses the least-square method to fit the lineMultiple regression : allows a response variable Y to

    be modeled as a linear function of multidimensional

    feature vectorLog-linear model : approximates discrete

    multidimensional probability distributions

  • 8/8/2019 DM Lecture 4

    4/24

    Linear regression : Y = w X + bTwo regression coefficients, w and b, specifythe line and are to be estimated by using thedata at hand

    Using the least squares criterion to the knownvalues of Y 1 , Y 2 , , X 1 , X 2 , .Multiple regression : Y = b0 + b1 X1 + b2 X2.

    Many nonlinear functions can be transformedinto the above

    Log-linear models :The multi-way table of joint probabilities isapproximated by a product of lower-ordertablesProbability: p(a , b, c, d) = Ea b Fa c G a d H bcd

  • 8/8/2019 DM Lecture 4

    5/24

    November 25, 2010 Data Mining: Concepts and Techniques 5

    Divide data into buckets and store average (sum) for eachbucket

    Partitioning rules:

    Equal-width : equal bucket range

    Equal-frequency (or equal-depth)

    V-optimal : with the least histogr am v a ri an ce (weighted sumof the original values that each bucket represents)

    MaxDiff : set bucket boundary between each pair for pairs

    have the 1 largest differences

  • 8/8/2019 DM Lecture 4

    6/24

    November 25, 2010 Data Mining: Concepts and Techniques 6

    Partition data set into clusters based on similarity, and

    store cluster representation (e.g., centroid and diameter)

    only

    Can be very effective if data is clustered but not if data is

    smeared

    Can have hierarchical clustering and be stored in multi-

    dimensional index tree structures

    There are many choices of clustering definitions and

    clustering algorithms

    Cluster analysis will be studied in depth in Chapter 7

  • 8/8/2019 DM Lecture 4

    7/24

    November 25, 2010 Data Mining: Concepts and Techniques 7

    S ampling : obtaining a small sample s torepresent the whole data set N Allow a mining algorithm to run in complexitythat is potentially sub-linear to the size of thedataChoose a representative subset of the data

    S imple random sampling may have very poorperformance in the presence of skew

    Develop adaptive sampling methodsS tratified sampling :

    Approximate the percentage of each class (orsubpopulation of interest) in the overalldatabaseUsed in conjunction with skewed data.

  • 8/8/2019 DM Lecture 4

    8/24

    November 25, 2010 Data Mining: Concepts and Techniques 8

    Sampling: With or WithoutReplacement

    Raw Data

  • 8/8/2019 DM Lecture 4

    9/24

    November 25, 2010 Data Mining: Concepts and Techniques 9

    Raw Data Cluster/Stratified Sample

  • 8/8/2019 DM Lecture 4

    10/24

    November 25, 2010 Data Mining: Concepts and Techniques 10

    Why preprocess the data?

    Data cleaning

    Data integration and transformationData reduction

    Discretization and concept hierarchy generation

    S ummary

  • 8/8/2019 DM Lecture 4

    11/24

    November 25, 2010 Data Mining: Concepts and Techniques 11

    Three types of attributes:

    Nominal values from an unordered set, e.g., color,profession

    Ordinal values from an ordered set, e.g., military or

    academic rankContinuous real numbers, e.g., integer or real numbers

    Discretization:

    Divide the range of a continuous attribute into intervals

    S ome classification algorithms only accept categoricalattributes

    Reduce data size by discretization

    Prepare for further analysis

  • 8/8/2019 DM Lecture 4

    12/24

    November 25, 2010 Data Mining: Concepts and Techniques 12

    Discretization

    Reduce the number of values for a given continuous attributeby dividing the range of the attribute into intervals

    Interval labels can then be used to replace actual data values

    S upervised vs. unsupervised

    S plit (top-down) vs. merge (bottom-up)

    Discretization can be performed recursively on an attribute

    Concept hierarchy formation

    Recursively reduce the data by collecting and replacing lowlevel concepts (such as numeric values for age) by higherlevel concepts (such as young, middle-aged, or senior)

  • 8/8/2019 DM Lecture 4

    13/24

    November 25, 2010 Data Mining: Concepts and Techniques 13

    Typical methods: All the methods can be applied recursively

    Binning (covered above)

    Top-down split, unsupervised,

    Histogram analysis (covered above)

    Top-down split, unsupervised

    Clustering analysis (covered above)

    Either top-down split or bottom-up merge, unsupervised

    Entropy-based discretization : supervised, top-down split

    Interval merging by G2 Analysis : unsupervised, bottom-upmerge

    S egmentation by natural partitioning : top-down split,unsupervised

  • 8/8/2019 DM Lecture 4

    14/24

    November 25, 2010 Data Mining: Concepts and Techniques 14

    Given a set of samples S , if S is partitioned into two intervalsS 1 and S 2 using boundary T, the information gain afterpartitioning is

    Entropy is calculated based on class distribution of thesamples in the set. G iven m classes, the entropy of S 1 is

    where p i is the probability of class i in S 1The boundary that minimizes the entropy function over all

    possible boundaries is selected as a binary discretizationThe process is recursively applied to partitions obtained untilsome stopping criterion is metS uch a boundary may reduce data size and improveclassification accuracy

    )(||||

    )(||||

    ),( 22

    11

    S EntropyS S

    S EntropyS S T S I !

    !

    !m

    iii p pS Entropy

    121 )(log)(

  • 8/8/2019 DM Lecture 4

    15/24

    November 25, 2010 Data Mining: Concepts and Techniques 15

    Merging-based (bottom-up) vs. splitting-based methods

    Merge: Find the best neighboring intervals and merge themto form larger intervals recursively

    ChiMerge [Kerber AAAI 199 2 , S ee also Liu et al. DMKD 2002]

    Initially, each distinct value of a numerical attr. A isconsidered to be one interval

    G2 tests are performed for every pair of adjacent intervals

    Adjacent intervals with the least G2 values are merged

    together, since low G2 values for a pair indicate similar classdistributions

    This merge process proceeds recursively until a predefinedstopping criterion is met (such as significance level)

  • 8/8/2019 DM Lecture 4

    16/24

    November 25, 2010 Data Mining: Concepts and Techniques 16

    A simply 3-4-5 rule can be used to segment numericdata into relatively uniform, natural intervals.

    If an interval covers 3, 6, 7 or 9 distinct values at

    the most significant digit, partition the range into 3equi-width intervals

    If it covers 2 , 4, or 8 distinct values at the mostsignificant digit, partition the range into 4 intervals

    If it covers 1, 5, or 1 0 distinct values at the mostsignificant digit, partition the range into 5 intervals

  • 8/8/2019 DM Lecture 4

    17/24

    November 25, 2010 Data Mining: Concepts and Techniques 17

    (-$400 - 0)

    (-$400 --$300)

    (-$300 --$200)

    (-$200 --$100)

    (-$100 -0)

    (0 - $1,000)

    (0 -$200)

    ($200 -$400)

    ($400 -$600)

    ($600 -$800) ($800 -

    $1,000)

    ($1,000 - $2, 000)

    ($1,000 -$1,200)

    ($1,200 -$1,400)

    ($1,400 -$1,600)

    ($1,600 -$1,800) ($1,800 -

    $2,000)

    (-$400 -$5,000)

    (-$400 - 0)

    (-$400 --$300)

    (-$300 --$200)

    (-$200 --$100)

    (-$100 -0)

    (0 - $1,000)

    (0 -$200)

    ($200 -$400)

    ($400 -$600)

    ($600 -$800) ($800 -

    $1,000)

    ($2,000 - $5, 000)

    ($2,000 -$3,000)

    ($3,000 -$4,000)

    ($4,000 -$5,000)

    ($1,000 - $2, 000)

    ($1,000 -

    $1,200)

    ($1,200 -$1,400)

    ($1,400 -$1,600)

    ($1,600 -$1,800) ($1,800 -

    $2,000)

    msd= 1,000 Lo w= -$1,000 H igh= $2,000Step 2:

    Step 4:

    Step 1: -$351 -$159 pr ofit $1,838 $4,700

    Min Low (i.e, 5%- tile ) High(i.e, 95%-0 tile ) Max

    count

    (-$1,000 - $2,000)

    (-$1,000 - 0) (0 -$ 1,000)

    Step 3:

    ($1,000 - $2,000)

  • 8/8/2019 DM Lecture 4

    18/24

    November 25, 2010 Data Mining: Concepts and Techniques 18

    S pecification of a partial/total ordering of attributesexplicitly at the schema level by users or experts

    street < city < state < countryS pecification of a hierarchy for a set of values byexplicit data grouping

    { Urbana, Champaign, Chicago} < IllinoisS pecification of only a partial set of attributes

    E.g., only street < city, not othersAutomatic generation of hierarchies (or attributelevels) by the analysis of the number of distinct values

    E.g., for a set of attributes: {street, city, state,country}

  • 8/8/2019 DM Lecture 4

    19/24

    November 25, 2010 Data Mining: Concepts and Techniques 19

    S ome hierarchies can be automaticallygenerated based on the analysis of the numberof distinct values per attribute in the data set

    The attribute with the most distinct values isplaced at the lowest level of the hierarchy

    Exceptions, e.g., weekday, month, quarter,year

    country

    pr ovince_ or_ state

    city

    street

    15 distinct values

    365 distinct values

    3567 distinct values

    674,339 distinct values

    15 distinct values

    365 distinct values

    3567 distinct values

    674,339 distinct values

  • 8/8/2019 DM Lecture 4

    20/24

    November 25, 2010 Data Mining: Concepts and Techniques 20

    Why preprocess the data?

    Data cleaning

    Data integration and transformation

    Data reduction

    Discretization and concept hierarchy

    generation

    S ummary

  • 8/8/2019 DM Lecture 4

    21/24

    November 25, 2010 Data Mining: Concepts and Techniques 21

    Data preparation or preprocessing is a big issuefor both data warehousing and data mining

    Descriptive data summarization is needed for

    quality data preprocessingData preparation includes

    Data cleaning and data integration

    Data reduction and feature selection

    Discretization

    A lot a methods have been developed but datapreprocessing still an active area of research.

  • 8/8/2019 DM Lecture 4

    22/24

  • 8/8/2019 DM Lecture 4

    23/24

    23

  • 8/8/2019 DM Lecture 4

    24/24

    1. What is meant by symmetric and skewed data [5 ]

    2 . Describe techniques for smoothing out data [1 0]

    3. Why is it important to carry out descriptive datasummarization? Justify your response through afictitious quantile-quantile plot [5 ]

    4. Why is it necessary to carry out co-relation analysis?[5 ]

    5. Describe data cube aggregation and itsadvantages [5 ]

    6. Can you suggest some change(s) to the state-of-the-art data pre-processing activity? [1 0]

    24