Data Preprocessing - baskent.edu.tr

Data Preprocessing

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

GENEL- PUBLIC

Data Preprocessing: An OverviewData Quality: Why Preprocess the Data?

Major Tasks in Data Preprocessing


GENEL- PUBLIC

Data Quality: Why Preprocess the Data?Data has quality if it satisfies the requirements of its intended use.

There are many factors comprising data quality.

These include: accuracy, completeness, consistency, timeliness, believability, and interpretability.


GENEL- PUBLIC

Major Tasks in Data Preprocessing


GENEL- PUBLIC

Data Cleaning


• Missing Values• Noisy Data

GENEL- PUBLIC

How can you go about filling in the missing values?Regression Analysis

Mod, Median, Mean


GENEL- PUBLIC

Noisy Data


• Binning Method• Equal Frequency Binning: bins

have an equal frequency.• Equal Width Binning : bins have

equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).

• Regression• Outlier Analysis• Statistical Methods

GENEL- PUBLIC

Noisy Data - Binning

Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Smoothing by bin means:

Bin 1: 9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 15

Bin 2: 21, 21, 24

Bin 3: 25, 25, 34


Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

GENEL- PUBLIC

Data Integration


The Entity Identification Problem

Redundancy and Correlation Analysis

Tuple Duplication

GENEL- PUBLIC

The Entity Identification ProblemThese sources may include multiple databases, data cubes

How can the data analyst or the computer be sure that customer id in one database and cust number inanother refer to the same attribute?

GUIDs are created and stored as 128-bit (16-byte) data using the MAC address, day, month, year, time of the systems they are produced, and the hardware information of the system in the active directory; It is usually displayed as 32 digits, according to the hexadecimal number system and as certain digits.


GENEL- PUBLIC

Redundancy and Correlation AnalysisRedundancy is another important issue in data integration. An attribute (such as annual revenue, for instance) may be redundant if it can be different DB from another attribute or set of attributes.

Some redundancies can be detected by correlation analysis.

Given two attributes, such analysis can measure how strongly one attribute shows similarity the other, based on the available data.

For nominal data, we use the 2 (chisquare) test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary with those of another.


GENEL- PUBLIC

𝑋2 Correlation Test for Nominal Data


For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of freedom, the 𝑋2 value needed to reject the hypothesis at the 0.001significance level is 10.828. These features areindependent.

GENEL- PUBLIC

𝑋2 Degrees of Freedom Table


GENEL- PUBLIC

Remove duplicate tuples from list


GENEL- PUBLIC

Data ReductionComplex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible.

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.

That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.


GENEL- PUBLIC

Wavelet TransformsThe most important feature of this method is that signals can be locally analyzed, whereby a large signal can be analyzed in a small area.

This analysis method enables signals to be analyzed on time domain and thus both low frequency information at a long time interval and high frequency information in short time interval can be defined. ,

Because of these advantages; wavelet analysis method is used in the analysis of time series and in a large variety of fields from the cylinder pressure data of internal combustion engines to data from Parkinson disease.


GENEL- PUBLIC

Wavelet Transforms


GENEL- PUBLIC

Feature selectionUsing feature selection techniques has many advantages:

1. Reduced training time

2. Less complex, thus easier to interpret.

3. Improved accuracy if right subset is chosen.

4. Reduces overfitting.


GENEL- PUBLIC

Feature selection using ReliefThere are three Algorithms in the Relief Family:

1. Basic Relief algorithm: It is limited to classification problems with two classes.

2. ReliefF : Extension of Relief . Which can deal with multiclass problems.

3. RReliefF : Then ReliefF was adapted for continuous class (regression)problems resulting in RReliefF algorithm.


GENEL- PUBLIC

Basic Relief AlgorithmPseudo code:

1.set all weights W[A] := 0.0;2. for i := 1 to m do begin3. randomly select an instance Rᵢ;4. find nearest hit H and nearest miss M;5. for A := 1 to a do6. W[A] := W[A]-diff(A,Rᵢ,H)/m + diff(A,Rᵢ,M)/m;7. end;


GENEL- PUBLIC

Basic Relief Algorithm


Here, row 1,2,..,5 are the instances. D is the target class (having two class 0 or 1).

A,B,C are the features.

We will find the weights of the attributes and then select 2 best features, i.e. features having the highest weights.

Let m = 2 (i.e we will perform 2 iterations).

Let all attributes weight be 0 , A=B=C=0,

GENEL- PUBLIC

Correlation based Feature Selection(CFS)


GENEL- PUBLIC

Information Gain


GENEL- PUBLIC

HistogramsThe following data are a list of prices of commonly sold items at AllElectronics (rounded to the nearest dollar). The numbers have been sorted:

1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.


GENEL- PUBLIC

Data TransformationIn this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.

1. Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.

2. Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.

3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts.

4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.

5. Discretization, where the raw values of a numeric attribute (such as age) are replaced by interval labels (e.g., 0-10, 11-20, and so on) or conceptual labels (e.g., youth, adult, and senior ).


GENEL- PUBLIC

Data Transformation by Normalization


GENEL- PUBLIC

Min – Max Normalization

Annual Salary Normalized Values

89.986

40.849

42.061

17.175

4.229

85.926

56.223

92.268

21.742

1.765 0

3.268 0,02

98.048 1,00

97.382 0,99


GENEL- PUBLIC

Simple Moving Average basedNormalization

Year Sales ($M) MA

2003 4

2004 6

2005 5 6,4

2006 8

2007 9

2008 5

2009 4

2010 3

2011 7

2012 8


The mean (average) sales for the first five years (2003-2007) is calculated by finding the mean from the first five years (i.e. adding the five sales totals and dividing by 5). This gives you the moving average for 2005 (the center year) = 6.4M:

Data Preprocessing - baskent.edu.tr

Documents

Transcript of Data Preprocessing - baskent.edu.tr