Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data...

Data Reduction Strategies

Why data reduction?A database/data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to

run on the complete data set

Data reduction Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategiesAggregationSamplingDimensionality ReductionFeature subset selectionFeature creationDiscretization (already covered specially) and BinarizationAttribute Transformation

Data Reduction : AggregationCombining two or more attributes (or

objects) into a single attribute (or object)

PurposeData reduction

Reduce the number of attributes or objectsChange of scale

Cities aggregated into regions, states, countries, etc

More “stable” data Aggregated data tends to have less variability

Data Reduction : Aggregation

Standard Deviation of Average Monthly Precipitation

Standard Deviation of Average Yearly Precipitation

Variation of Precipitation in Australia

Data Reduction : Sampling

Sampling is the main technique employed for data selection. It is often used for both the preliminary

investigation of the data and the final data analysis.

Statisticians sample because obtaining the

entire set of data of interest is too expensive or time consuming.

Sampling is used in data mining because

processing the entire set of data of interest is too expensive or time consuming.

Data Reduction : Types of SamplingSimple Random Sampling

There is an equal probability of selecting any particular item

Sampling without replacementAs each item is selected, it is removed from

the population

Sampling with replacementObjects are not removed from the population

as they are selected for the sample. In sampling with replacement, the same object can be

picked up more than once

Sampling MethodAllow a mining algorithm to run in complexity that

is potentially sub-linear to the size of the dataChoose a representative subset of the data

Simple random sampling may have very poor performance in the presence of skew

Develop adaptive sampling methodsStratified sampling:

Approximate the percentage of each class (or subpopulation of interest) in the overall database

Used in conjunction with skewed dataSampling may not reduce database I/Os (page at a

time).

Sampling

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Sampling

Raw Data Cluster/Stratified Sample

Data Reduction Feature Subset SelectionAnother way to reduce dimensionality of

data

Redundant features duplicate much or all of the information

contained in one or more other attributesExample: purchase price of a product and the

amount of sales tax paid

Irrelevant featurescontain no information that is useful for the

data mining task at handExample: students' ID is often irrelevant to

the task of predicting students' GPA

Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data...

Documents

Transcript of Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data...