Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data...
-
Upload
lenard-sutton -
Category
Documents
-
view
212 -
download
0
Transcript of Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data...
Data Reduction Strategies
Why data reduction?A database/data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to
run on the complete data set
Data reduction Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same) analytical results
Data reduction strategiesAggregationSamplingDimensionality ReductionFeature subset selectionFeature creationDiscretization (already covered specially) and BinarizationAttribute Transformation
Data Reduction : AggregationCombining two or more attributes (or
objects) into a single attribute (or object)
PurposeData reduction
Reduce the number of attributes or objectsChange of scale
Cities aggregated into regions, states, countries, etc
More “stable” data Aggregated data tends to have less variability
Data Reduction : Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
Data Reduction : Sampling
Sampling is the main technique employed for data selection. It is often used for both the preliminary
investigation of the data and the final data analysis.
Statisticians sample because obtaining the
entire set of data of interest is too expensive or time consuming.
Sampling is used in data mining because
processing the entire set of data of interest is too expensive or time consuming.
Data Reduction : Types of SamplingSimple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacementAs each item is selected, it is removed from
the population
Sampling with replacementObjects are not removed from the population
as they are selected for the sample. In sampling with replacement, the same object can be
picked up more than once
Sampling MethodAllow a mining algorithm to run in complexity that
is potentially sub-linear to the size of the dataChoose a representative subset of the data
Simple random sampling may have very poor performance in the presence of skew
Develop adaptive sampling methodsStratified sampling:
Approximate the percentage of each class (or subpopulation of interest) in the overall database
Used in conjunction with skewed dataSampling may not reduce database I/Os (page at a
time).
Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
Sampling
Raw Data Cluster/Stratified Sample
Data Reduction Feature Subset SelectionAnother way to reduce dimensionality of
data
Redundant features duplicate much or all of the information
contained in one or more other attributesExample: purchase price of a product and the
amount of sales tax paid
Irrelevant featurescontain no information that is useful for the
data mining task at handExample: students' ID is often irrelevant to
the task of predicting students' GPA