Data Preprocessing - baskent.edu.tr
Transcript of Data Preprocessing - baskent.edu.tr
Data Preprocessing
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Preprocessing: An OverviewData Quality: Why Preprocess the Data?
Major Tasks in Data Preprocessing
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Quality: Why Preprocess the Data?Data has quality if it satisfies the requirements of its intended use.
There are many factors comprising data quality.
These include: accuracy, completeness, consistency, timeliness, believability, and interpretability.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Major Tasks in Data Preprocessing
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Cleaning
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
• Missing Values• Noisy Data
GENEL- PUBLIC
How can you go about filling in the missing values?Regression Analysis
Mod, Median, Mean
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Noisy Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
• Binning Method• Equal Frequency Binning: bins
have an equal frequency.• Equal Width Binning : bins have
equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
• Regression• Outlier Analysis• Statistical Methods
GENEL- PUBLIC
Noisy Data - Binning
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
GENEL- PUBLIC
Data Integration
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
The Entity Identification Problem
Redundancy and Correlation Analysis
Tuple Duplication
GENEL- PUBLIC
The Entity Identification ProblemThese sources may include multiple databases, data cubes
How can the data analyst or the computer be sure that customer id in one database and cust number inanother refer to the same attribute?
GUIDs are created and stored as 128-bit (16-byte) data using the MAC address, day, month, year, time of the systems they are produced, and the hardware information of the system in the active directory; It is usually displayed as 32 digits, according to the hexadecimal number system and as certain digits.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Redundancy and Correlation AnalysisRedundancy is another important issue in data integration. An attribute (such as annual revenue, for instance) may be redundant if it can be different DB from another attribute or set of attributes.
Some redundancies can be detected by correlation analysis.
Given two attributes, such analysis can measure how strongly one attribute shows similarity the other, based on the available data.
For nominal data, we use the 2 (chisquare) test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary with those of another.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
𝑋2 Correlation Test for Nominal Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of freedom, the 𝑋2 value needed to reject the hypothesis at the 0.001significance level is 10.828. These features areindependent.
GENEL- PUBLIC
𝑋2 Degrees of Freedom Table
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Remove duplicate tuples from list
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data ReductionComplex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.
That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Wavelet TransformsThe most important feature of this method is that signals can be locally analyzed, whereby a large signal can be analyzed in a small area.
This analysis method enables signals to be analyzed on time domain and thus both low frequency information at a long time interval and high frequency information in short time interval can be defined. ,
Because of these advantages; wavelet analysis method is used in the analysis of time series and in a large variety of fields from the cylinder pressure data of internal combustion engines to data from Parkinson disease.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Wavelet Transforms
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Wavelet Transforms
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Feature selectionUsing feature selection techniques has many advantages:
1. Reduced training time
2. Less complex, thus easier to interpret.
3. Improved accuracy if right subset is chosen.
4. Reduces overfitting.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Feature selection using ReliefThere are three Algorithms in the Relief Family:
1. Basic Relief algorithm: It is limited to classification problems with two classes.
2. ReliefF : Extension of Relief . Which can deal with multiclass problems.
3. RReliefF : Then ReliefF was adapted for continuous class (regression)problems resulting in RReliefF algorithm.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Basic Relief AlgorithmPseudo code:
1.set all weights W[A] := 0.0;2. for i := 1 to m do begin3. randomly select an instance Rᵢ;4. find nearest hit H and nearest miss M;5. for A := 1 to a do6. W[A] := W[A]-diff(A,Rᵢ,H)/m + diff(A,Rᵢ,M)/m;7. end;
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Basic Relief Algorithm
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
Here, row 1,2,..,5 are the instances. D is the target class (having two class 0 or 1).
A,B,C are the features.
We will find the weights of the attributes and then select 2 best features, i.e. features having the highest weights.
Let m = 2 (i.e we will perform 2 iterations).
Let all attributes weight be 0 , A=B=C=0,
GENEL- PUBLIC
Correlation based Feature Selection(CFS)
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Information Gain
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
HistogramsThe following data are a list of prices of commonly sold items at AllElectronics (rounded to the nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data TransformationIn this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.
1. Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (such as age) are replaced by interval labels (e.g., 0-10, 11-20, and so on) or conceptual labels (e.g., youth, adult, and senior ).
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Transformation by Normalization
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Min – Max Normalization
Annual Salary Normalized Values
89.986
40.849
42.061
17.175
4.229
85.926
56.223
92.268
21.742
1.765 0
3.268 0,02
98.048 1,00
97.382 0,99
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Simple Moving Average basedNormalization
Year Sales ($M) MA
2003 4
2004 6
2005 5 6,4
2006 8
2007 9
2008 5
2009 4
2010 3
2011 7
2012 8
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
The mean (average) sales for the first five years (2003-2007) is calculated by finding the mean from the first five years (i.e. adding the five sales totals and dividing by 5). This gives you the moving average for 2005 (the center year) = 6.4M: