Data Preparation - paginas.fe.up.ptec/files_0405/slides/03 DataPreparation.pdf · Data Preparation...

21
Data Preparation (Data preprocessing) 2 Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction, Feature selection Discretization and concept hierarchy generation 3 Why Prepare Data? Some data preparation is needed for all mining tools The purpose of preparation is to transform data sets so that their information content is best exposed to the mining tool Error prediction rate should be lower (or the same) after the preparation as before it 4 Why Prepare Data? Preparing data also prepares the miner so that when using prepared data the miner produces better models, faster GIGO - good data is a prerequisite for producing effective models of any type Some techniques are based on theoretical considerations, while others are rules of thumb based on experience

Transcript of Data Preparation - paginas.fe.up.ptec/files_0405/slides/03 DataPreparation.pdf · Data Preparation...

Data Preparation(Data preprocessing)

2

Data Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction, Feature selection

• Discretization and concept hierarchy generation

3

Why Prepare Data?

• Some data preparation is needed for all mining tools

• The purpose of preparation is to transform data sets so that their information content is best exposed to the mining tool

• Error prediction rate should be lower (or the same) after the preparation as before it

4

Why Prepare Data?

• Preparing data also prepares the miner so that when using prepared data the miner produces better models, faster

• GIGO - good data is a prerequisite for producing effective models of any type

• Some techniques are based on theoretical considerations, while others are rules of thumb based on experience

5

Why Prepare Data?

• Data in the real world is dirty

• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

• e.g., occupation=“”

• noisy: containing errors or outliers• e.g., Salary=“-10”, Age=“222”

• inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997”

• e.g., Was rating “1,2,3”, now rating “A, B, C”

• e.g., discrepancy between duplicate records

6

Why Data is Dirty?

• Incomplete data comes from

• Value not available when data collected• Different consideration between the time when the data was collected and

when it is analyzed• Human/Hardware/Software problems

• Noisy data comes from the process of data

• Collection• Entry• Transmission

• Inconsistent data comes from

• Different data sources• Functional dependency violation

7

Why Is Data Preprocessing Important?

• Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon

• Data preparation takes more than 80% of the data mining project time

• Data preparation requires knowledge of the “business”

8

Major Tasks in Data Preprocessing

• Data cleaning

• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

• Data integration

• Integration of multiple databases, data cubes, or files

• Data transformation

• Normalization and aggregation

• Data reduction

• Obtains reduced representation in volume but produces the same or similar analytical results

• Data discretization

• Part of data reduction but with particular importance, especially for numerical data

9

Forms of data preprocessing

10

Data Preparation as a step in the Knowledge Discovery Process

Cleaning and Integration

Selection and Transformation

Data Mining

Evaluation and Presentation

Knowledge

DB

DW

Data preproc

essing

11

Types of Data Measurements

• Measurements differ in their nature and the amount of information they give

• Qualitative vs. Quantitative

12

Types of Measurements

• Nominal scale

• Gives unique names to objects - no other information deducible• Names of people

13

Types of Measurements

• Nominal scale

• Categorical scale

• Names categories of objects

• Although maybe numerical, not ordered• ZIP codes

• Hair color

• Gender: Male, Female

• Marital Status: Single, Married, Divorcee, Widower

14

Types of Measurements

• Nominal scale

• Categorical scale

• Ordinal scale

• Measured values can be ordered naturally

• Transitivity: (A > B) and (B > C) ⇒ (A > C)• “blind” tasting of wines

• Classifying students as: Very, Good, Good Sufficient,...

• Temperature: Cool, Mild, Hot

15

Types of Measurements

• Nominal scale

• Categorical scale

• Ordinal scale

• Interval scale

• The scale has a means to indicate the distance that separates measured values

• temperature

16

Types of Measurements

• Nominal scale

• Categorical scale

• Ordinal scale

• Interval scale

• Ratio scale

• measurement values can be used to determine a meaningful ratio between them

• Bank account balance• Weight• Salary

17

Types of Measurements

• Nominal scale

• Categorical scale

• Ordinal scale

• Interval scale

• Ratio scale

Qualitative

Quantitative

More inform

ation content

Discrete or Continuous

18

Data Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation

19

Data Cleaning

• Data cleaning tasks

• Deal with missing values

• Identify outliers and smooth out noisy data

• Correct inconsistent data

20

Definitions

• Missing value - not captured in the data set: errors in feeding, transmission, ...

• Empty value - no value in the population

• Outlier - out-of-range value

21

Missing Data

• Data is not always available

• E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

• Missing data may be due to

• equipment malfunction• inconsistent with other recorded data and thus deleted• data not entered due to misunderstanding• certain data may not be considered important at the time of entry• not register history or changes of the data

• Missing data may need to be inferred.

• Missing values may carry some information content: e.g. a credit application may carry information by noting which field the applicant did not complete

22

Missing Values

• There are always MVs in a real dataset

• MVs may have an impact on modeling, in fact, they can destroy it!

• Some tools ignore missing values, others use some metric to fillin replacements

• The modeler should avoid default automated replacement techiniques

• Difficult to know limitations, problems and introduced bias

• Replacing missing values without elsewhere capturing that information removes information from the dataset

23

How to Handle Missing Data?

• Ignore records (use only cases with all values)

• Usually done when class label is missing as most prediction methods do not handle missing data well

• Not effective when the percentage of missing values per attribute varies considerably as it can lead to insufficient and/or biasedsample sizes

• Ignore attributes with missing values

• Use only features (attributes) with all values (may leave out important features)

• Fill in the missing value manually

• tedious + infeasible?

24

How to Handle Missing Data?

• Use a global constant to fill in the missing value

• e.g., “unknown”. (May create a new class!)

• Use the attribute mean to fill in the missing value

• It will do the least harm to the mean of existing data

• If the mean is to be unbiased

• What if the standard deviation is to be unbiased?

• Use the attribute mean for all samples belonging to the same class to fill in the missing value

25

How to Handle Missing Data?

• Use the most probable value to fill in the missing value

• Inference-based such as Bayesian formula or decision tree

• Identify relationships among variables• Linear regression, Multiple linear regression, Nonlinear regression

• Nearest-Neighbor estimator• Finding the k neighbors nearest to the point and fill in the most frequent

value or the average value

• Finding neighbors in a large dataset may be slow

26

How to Handle Missing Data?

• Note that, it is as important to avoid adding bias and distortion to the data as it is to make the information available.

• bias is added when a wrong value is filled-in

• No matter what techniques you use to conquer the problem, it comes at a price. The more guessing you have to do, the further away from the real data the database becomes. Thus, in turn, it can affect the accuracy and validation of the mining results.

27

Noisy Data

• Noise: random error or variance in a measured variable

• Incorrect attribute values may due to

• faulty data collection instruments

• data entry problems

• data transmission problems

• technology limitation (ex: limited buffer size)

• inconsistency in naming convention

28

How to Handle Noisy Data?

• Binning method (smooth a value by consulting its neighborhood):

• first sort data and partition it into a number of bins• then one can smooth by bin means, smooth by bin median or smooth by

bin boundaries

• Clustering

• detect and remove outliers • (Outliers may be what we are looking for – fraud detection)

• Regression

• smooth by fitting the data into regression functions

• Combined computer and human inspection

• detect suspicious values (using measures that reflect the degree of surprise or compare the predicted class label with the know label) and check by human

29

Binning for Data Smoothing

• Equal-width (distance) partitioning:

• It divides the range into N intervals of equal size: uniform grid

• If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N.

• The most straightforward method

• Outliers may dominate presentation

• Skewed data is not handled well.

Disadvantage

(a) Unsupervised

(b) Where does N come from?

(c) Sensitive to outliers

Advantage

(a) simple and easy to implement

(b) produce a reasonable abstraction of data30

Binning for Data Smoothing

• Equal-depth (frequency) partitioning:

• It divides the range into N intervals, each containing approximately same number of samples

31

Binning for Data Smoothing

• Sorted data for price (in $): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

• Partition into (equi-depth) bins:

• Bin 1: 4, 8, 9, 15• Bin 2: 21, 21, 24, 25• Bin 3: 26, 28, 29, 34

• Smoothing by bin means:

• Bin 1: 9, 9, 9, 9• Bin 2: 23, 23, 23, 23• Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries (each value replaced by the closest boundary value):

• Bin 1: 4, 4, 4, 15• Bin 2: 21, 21, 25, 25• Bin 3: 26, 26, 26, 34

32

Cluster Analysis

33

Regression

34

Data Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation

35

Data Integration

• Detecting and resolving data value conflicts

• For the same real world entity, attribute values from different sources may be different

• Which source is more reliable ?

• Is it possible to induce the correct value?

• Possible reasons: different representations, different scales, e.g., metric vs. British units

Data integration requires knowledge of the “business”

36

Solving Interschema Conflicts• Classification conflicts

• Corresponding types describe different sets of real world elements. DB1:authors of journal and conference papers; DB2 authors of conference papers only.

• Generalization / specialization hierarchy

• Descriptive conflicts

• naming conflicts : synonyms , homonyms• cardinalities: firstname : one , two , N values• domains: salary : $, Euro ... ; student grade : [ 0 : 20 ] , [1 : 5 ]

• Structural conflicts

• DB1 : Book is a class; DB2 : books is an attribute of Author• Choose the less constrained structure (Book is a class)

• Fragmentation conflicts

• DB1: Class Road_segment ; DB2: Classes Way_segment , Separator• Aggregation relationship

37

Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple databases

• The same attribute may have different names in different databases

• One attribute may be a “derived” attribute in another table, e.g., annual revenue

38

Handling Redundancy in Data Integration

• Redundant data may be detected by correlation analysis

( ) ( )

( ) ( )( )11

11

11

11

1

2

1

2

1 ≤≤−

−⋅−

⋅−⋅−

−⋅−⋅−=

∑∑

==

=XYN

nn

N

nn

N

nnn

XY ryy

Nxx

N

yyxxNr

39

Scatter Matrix

40

• Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Integration

41

Data Transformation

• Data may have to be transformed to be suitable for a DM technique

• Smoothing: remove noise from data (binning, regression, clustering)

• Aggregation: summarization, data cube construction

• Generalization: concept hierarchy climbing

• Attribute/feature construction

• New attributes constructed from the given ones (add att. area which is based on height and width)

• Normalization

• scale values to fall within a smaller, specified range

42

Normalization

• For distance-based methods, normalization helps to prevent that attributes with large ranges out-weight attributes with small ranges

• min-max normalization

• z-score normalization

• normalization by decimal scaling

43

Normalization

• min-max normalization

• z-score normalization

• normalization by decimal scaling

v −= − +

−min' (new _ max new_min ) new_min

max minv

v v vv v

v

'v

v vvσ−

=

j

vv10

'=Where j is the smallest integer such that Max(| |)<1'v

range: -986 to 917 => j=3 -986 -> -0.986 917 -> 0.917

44

Data Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation

45

Data Reduction Strategies

• Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set

• Data reduction

• Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

• Data reduction strategies

• Data cube aggregation

• Dimensionality reduction

• Numerosity reduction

• Discretization and concept hierarchy generation

46

Data Cube Aggregation

• Data can be aggregated so that the resulting data summarize, forexample, sales per year instead of sales per quarter.

• Reduced representation which contains all the relevant information if we are concerned with the analysis of yearly sales

47

Data Cube Aggregation

• The lowest level of a data cube

• the aggregated data for an individual entity of interest

• e.g., a customer in a phone calling data warehouse.

• Reference appropriate levels

• Use the smallest representation which is enough to solve the task

48

Data Cube Aggregation

49

Dimensionality Reduction

• Reduces the data set size by removing attributes which may be irrelevant to the mining task

• ex. is the telephone number relevant to determine if a customer is likely to buy a given CD?

• Although it is possible for the analyst to identify some irrelevant, or useful, attributes this can be a difficult and time consuming task, thus, the need of methods for attribute subset selection.

50

Dimensionality Reduction

• Feature selection (i.e., attribute subset selection):

• Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

• Reduce number of patterns and reduce the number of attributes appearing in the patterns

• Patterns are easier to understand

51

Heuristic Feature Selection Methods

• There are 2d possible sub-features of d features

• Heuristic feature selection methods:

• Best single features under the feature independence assumption: • choose by significance tests or information gain measures.

• a feature is interesting if it reduces uncertainty

No Improvement

40 60

28 42 12 18

40 60

40 60

Perfect Split52

Heuristic Feature Selection Methods

• Heuristic methods

• step-wise forward selection

• step-wise backward elimination

• combining forward selection and backward elimination

• decision-tree induction

53

p 1-p Ent0.2 0.8 0.720.4 0.6 0.970.5 0.5 10.6 0.4 0.970.8 0.2 0.72

p1 p2 p3 Ent0.1 0.1 0.8 0.920.2 0.2 0.6 1.370.1 0.45 0.45 1.370.2 0.4 0.4 1.520.3 0.3 0.4 1.57

0.33 0.33 0.33 1.58

Entropy

log2(3)

log2(2)

54

Entropy/Impurity

• S - training set, C1,...,CN classes

• Entropy E(S) - measure of the impurity in a group of examples

• pc - proportion of Cc in S

21=

= − ⋅∑Impurity( ) logN

c cc

S p p

55

Impurity

Very impure group Less impure Minimum impurity

56

Information Gain

• Information Gain = Impurity (parent) – [Impurity (children)]

• Using attribute A a set S is split into sets {S1, S2 } • S1 contains p1 examples of P and n1 examples of N• S2 contains p2 examples of P and n2 examples of N

• Most informative attribute: max Information Gain

1 1 2 21 2

+ += ⋅ + ⋅

+ +Inpurity( ) Impurity( ) Impurity( )p n p nchildren S S

p n p n

57

Class 2

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1

Reduced attribute set: {A1, A4, A6}

Example of Decision Tree Induction

58

Principal Component Analysis

• When we collect multivariate data it is not uncommon to discover that at least some of the variables are correlated with each other.

• Therefore, there will be some redundancy in the information provided by the variables.

• In the extreme case of two perfectly correlated variables (x & y) one of them is redundant since if we know the value of the x variable the value of the y variable has no freedom and vice versa.

• Principal Components Analysis exploits the redundancy in multivariate data, enabling us to:

• reduce the dimensionality of our data set without a significant loss of information.

59

Principal Component Analysis

• Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data

• The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)

• Each data vector is a linear combination of the c principal component vectors

• Works for numeric data only• Used when the number of dimensions is large

60

with perfect correlation one axe holds all the information

the information regarding u2 could be discarded without a big loss

Principal Component Analysis

61

Numerosity Reduction

• Parametric methods

• Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

• Non-parametric methods

• Do not assume models

• Major families: histograms, clustering, sampling

62

Regression Analysis

• Linear regression: Y = α + β X

• Data are modeled to fit a straight line

• Two parameters , α and β specify the line and are to be estimated by using the data at hand.

• Using the least squares criterion to the known values of Y1,Y2,…,X1,X2,….

• Multiple regression: Y = b0 + b1 X1 + b2 X2.

• allows a response variable Y to be modeled as a linear function of multidimensional feature vector

• Many nonlinear functions can be transformed into the above.

63

Histograms• A popular data reduction

technique

• Divide data into buckets and store average (sum) for each bucket

• Can be constructed optimally in one dimension using dynamic programming:

• 0ptimal if has minimum variance. Hist. variance is a weighted sum of the variance of the source values in each bucket.

0

5

10

15

20

25

30

35

40

10000 30000 50000 70000 90000

64

Clustering

• Partition a data set into clusters makes it possible to store cluster representation only

• Can be very effective if data is clustered but not if data is “smeared”

• There are many choices of clustering definitions and clustering algorithms, further detailed in next lessons

65

Sampling

• The cost of sampling is proportional to the sample size and not to the original dataset size, therefore, a mining algorithm’s complexity is potentially sub-linear to the size of the data

• Choose a representative subset of the data

• Simple random sampling (with or without reposition)

• Stratified sampling:

• Approximate the percentage of each class (or subpopulation of interest) in the overall database

• Used in conjunction with skewed data

66

SRSWOR(simple randomsample without

replacement)

SRSWRRaw Data

1 2

1 2 1 2

= =

= ⋅

( ) ( ) ( )( , ) ( ) ( )

p Y p Y p Yp Y Y p Y p Y

1 2

1 2 1 2

= =

≠ ⋅

( ) ( ) ( )( , ) ( ) ( )

p Y p Y p Yp Y Y p Y p Y

Sampling

67

Sampling

Raw Data Cluster/Stratified Sample

The target population may consist of a series of separate 'sub-populations‘. If we ignore this possibility the population estimates we derive will be a sort of 'average of the averages' for the sub-populations, and may therefore be meaningless.

Usually the strata are of equal sizes but we may also decide to use strata whose relative sizes reflect the estimated proportions of the sub-populations within the whole population. Within each stratum we select a sample, ensuring that the probability of selection is the same for each unit in each sub-population.

The process of splitting the sample to take account of possible sub-populations is called stratification.

68

Increasing Dimensionality

• In some circumstances the dimensionality of a variable need to be increased:

• Color from a category list to the RGB values

• ZIP codes from category list to latitude and longitude

69

Data Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation

70

Discretization

• Discretization:

• Divide the range of a continuous attribute into intervals

• Some classification algorithms only accept categorical attributes.

• Reduce data size by discretization

• Prepare for further analysis

71

Discretization and Concept hierarchy

• Discretization

• Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

• Concept hierarchies

• Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

• A recursive application of discretization methods can provide a hierarchical partitioning of the attribute values.

72

Discretization and concept hierarchy generation for numeric data

• Binning (see sections before)

• Histogram analysis (see sections before)

• Clustering analysis (see sections before)

• Entropy-based discretization

73

Entropy Based Discretization (1)

1. Sort examples in increasing order

2. Each value forms an interval (‘m’ interals)

3. Calculate the entropy measure of this discretazation

4. Merge the pair of adjacent intervals whose merging would improve hte entropy

5. Repeat 3-4 with new discretization intervals

74

Entropy Based Discretization (2)

1. Sort examples in increasing order

2. Each value forms an interval (‘m’ interals)

3. Calculate the entropy measure of this discretazation

4. Find the binary split boundary that minimizes the entropy function over all possible boundaries. The split is selected as a binary discretization.

5. Apply the process recursively until some stopping criterion is met, e.g.,

1 21 2= +

| | | |( , ) ( ) ( )| | | |

E S T Ent EntS SS SS S

− >( ) ( , )Ent S E T S δ

75

Example111111141515151616182121212223

x freq. p p.log(p)11 3 0.20 -0.4614 1 0.07 -0.2615 3 0.20 -0.4616 2 0.13 -0.3918 1 0.07 -0.2621 3 0.20 -0.4622 1 0.07 -0.2623 1 0.07 -0.26

15 1.00 2.82

x freq. p p.log(p)11 3 0.75 -0.3114 1 0.25 -0.5

4 1.00 0.8111.99

15 3 0.27 -0.5116 2 0.18 -0.4518 1 0.09 -0.3121 3 0.27 -0.5122 1 0.09 -0.3123 1 0.09 -0.31

11 1.00 2.41

x freq. p p.log(p)11 3 1.00 0.00

3 1.00 0.002.1

14 1 0.08 -0.3015 3 0.25 -0.5016 2 0.17 -0.4318 1 0.08 -0.3021 3 0.25 -0.5022 1 0.08 -0.3023 1 0.08 -0.30

12 1.00 2.63

x freq. p p.log(p)11 3 0.43 -0.5214 1 0.14 -0.4015 3 0.43 -0.52

7 1.00 1.451.83

16 2 0.25 -0.5018 1 0.13 -0.3821 3 0.38 -0.5322 1 0.13 -0.3823 1 0.13 -0.38

8.00 1.00 2.16

x freq. p p.log(p)11 3 0.33 -0.5314 1 0.11 -0.3515 3 0.33 -0.5316 2 0.22 -0.48

9 1.00 1.891.85

18 1 0.17 -0.4321 3 0.50 -0.5022 1 0.17 -0.4323 1 0.17 -0.43

6 1.00 1.79

... 1.90 2.26 2.47

99.083.182.2

),()(

=−=

=− TSISEnt

Bes

t Sp

lit

76

Concept Hierarchies

• A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts

• E.g. City values for location include Los Angeles or New York

• Each city can be mapped to the state or province where it belongs

• A state or province can in turn, be mapped to the country to which they belong

• These mappings form a concept hierarchy

77

A Concept Hierarchy: Dimension (location)

all

Europe North America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

78

Concept Hierarchy: examples

• Year, quarter, month, day

• Store Country, Store State, Store City, Store Name

• Product name, Brand name, product subcategory (Seefood, Baking goods), product family (drink, food, ...)

79

Specification of a set of attributes

Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.

country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

Counter example: 20 years, 12 months, 7 days of week80

References

• ‘Data preparation for data mining’, Dorian Pyle, 1999

• ‘Data Mining: Concepts and Techniques’, Jiawei Han and MichelineKamber, 2000

• ‘Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations’, Ian H. Witten and Eibe Frank, 1999

81

Thank you !!!Thank you !!!