Data preprocessing

A Brief Presentation on Data Mining

Jason Rodrigues

Data Preprocessing

• Introduction

• Why data proprocessing?

• Data Cleaning

• Data Integration and Transformation

• Data Reduction

• Discretization and concept Heirarchy generation

• Takeaways

Agenda

Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes

of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names

No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data

A multi-dimensional measure of data quality A well-accepted multi-dimensional view: accuracy, completeness, consistency, timeliness, believability, value

added, interpretability, accessibilityBroad categories

intrinsic, contextual, representational, and accessibility

Data Preprocessing

Major Tasks of Data Preprocessing

Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and

resolve inconsistencies

Data integration Integration of multiple databases, data cubes, files, or notes

Data trasformation Normalization (scaling to a specific range) Aggregation

Data reduction Obtains reduced representation in volume but produces the same or

similar analytical results Data discretization: with particular importance, especially for numerical

data Data aggregation, dimensionality reduction, data compression,

generalization

Data Preprocessing

Major Tasks of Data Preprocessing

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Data Cleaning

Tasks of Data Cleaning Fill in missing values Identify outliers and smooth noisy data Correct inconsistent data

Data Cleaning

Manage Missing Data Ignore the tuple: usually done when class label is missing (assuming the

task is classification—not effective in certain cases) Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a

new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples of the same class to fill in

the missing value: smarter Use the most probable value to fill in the missing value: inference-

based such as regression, Bayesian formula, decision tree

Data Cleaning

Manage Noisy DataBinning Method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etcClustering: detect and remove outliersSemi Automated Computer and Manual InterventionRegression Use regression functions

Data Cleaning

Cluster Analysis

Data Cleaning

Regression Analysis

x

y

y = x + 1

X1

Y1

Y1’

•Linear regression (best line to fit two variables)

•Multiple linear regression (more than two variables, fit to a multidimensional surface

Data Cleaning

Inconsistant Data Manual correction using external

references Semi-automatic using various tools

To detect violation of known functional dependencies and data constraints

To correct redundant data

Data integration and transformation

Tasks of Data Integration and transformation Data integration:

combines data from multiple sources into a coherent store

Schema integration integrate metadata from different sources Entity identification problem: identify real world entities

from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts

for the same real world entity, attribute values from different sources are different

possible reasons: different representations, different scales, e.g., metric vs. British units, different currency

Manage Data Integration


Redundant data occur often when integrating multiple DBs

The same attribute may have different names in different databases

One attribute may be a “derived” attribute in another table, e.g., annual revenue

Redundant data may be able to be detected by correlational analysis

• Careful integration can help reduce/avoid redundancies and

inconsistencies and improve mining speed and quality

BABA n

BBAAr

)1(

))((,

Manage Data Transformation


Smoothing: remove noise from data (binning, clustering, regression)

Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range

min-max normalization

z-score normalization

normalization by decimal scaling Attribute/feature construction

New attributes constructed from the given ones

Manage Data Reduction

Data reduction

Data reduction: reduced representation, while still retaining critical information

Data cube aggregation

Dimensionality reduction

Data compression

Numerosity reduction

Discretization and concept hierarchy generation

Data Cube Aggregation

Data reduction

Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with

Reference appropriate levels Use the smallest representation capable

to solve the task

Data Compression

Data reduction

String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion

Audio/video, image compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed

without reconstructing the whole Time sequence is not audio

Typically short and vary slowly with time

` `

Decision Tree

Data reduction

Similarities and Dissimilarities

Proximity Proximity is used to refer to Similarity or

Dissimilarity, since proximity between the object is a function of proximity between the corresponding attributes of two objects.

Similarity: Numeric measure of the degree to which the two objects are alike.

Dissimilarity: Numeric measure of the degree to which the two objects are different.

Dissimilarities between Data Objects? Similarity

Numerical measure of how alike two data objects are.

Is higher when objects are more alike. Often falls in the range [0,1]

Dissimilarity Numerical measure of how different are two data

objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies

Proximity refers to a similarity or dissimilarity

Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes)

and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Standardization is necessary, if scales differ.

n

kkk qpdist

1

2)(

Euclidean Distance

Euclidean Distance

n

kkk qpdist

1

2)(

1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

0

5

4

3

Column 2

Minkowski Distance r = 1. City block (Manhattan, taxicab, L1 norm)

distance. A common example of this is the Hamming distance, which is just

the number of bits that are different between two binary vectors

r = 2. Euclidean distance

r . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the

vectors Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Minkowski Distance

point x yp1 0 2p2 2 0p3 3 1p4 5 1

L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0

L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0

Euclidean Distance Properties

• Distances, such as the Euclidean distance, have some well known properties.

1. d(x, y) 0 for all x and y and d(x, y) = 0 only if x = y. (Positive definiteness)

2. d(x, y) = d(y, x) for all x and q. (Symmetry)3. d(x, y) d(x, y) + d(y, z) for all points x, y, and z.

(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.

• A distance that satisfies these properties is a metric, and a space is called a metric space

Non Metric Dissimilarities – Set Differences

non-metric measures are often robust(resistant to outliers, errors in objects, etc.)

the symmetry and mainly the triangular inequality are often violated

cannot be directly used with MAMs

a

b

a > b + c

c

ab

a ≠ b

Non Metric Dissimilarities – Time

various k-median distances measure distance between the two (k-th) most

similar portions in objects

COSIMIR back-propagation network with single output

neuron serving as a distance, allows training

Dynamic Time Warping distance sequence alignment technique minimizes the sum of distances between

sequence elements

fractional Lp distances generalization of Minkowski distances (p<1) more robust to extreme differences in coordinates

Jaccard Coeffificient

Recall: Jaccard coefficient is a commonly used measure of overlap of two sets A and B

jaccard(A,B) = |A ∩ B| / |A ∪ B|

jaccard(A,A) = 1

jaccard(A,B) = 0 if A ∩ B = 0

A and B don’t have to be the same size. JC always assigns a number between 0 and 1.

Takeaways

Why Data Preprocessing?

Data Cleaning

Data Integration and Transformation

Data Reduction

Discretization and concept Heirarchy generation

Data preprocessing

Technology

Transcript of Data preprocessing