BC 2014 Session2

No Slide Title

Wastage to the US economy of $3 trillion.

15-20% of IT operating budget of most corporates

Implicated in the Challenger shuttle and Iranian passenger jet shootdown incidents.

Clue: 50-80% of a data mining projects time may be spent doing this.

Whodunit ? Again ?

1

2

Data Cleaning

3

Why Data Cleaning?

Data in the real world is often dirty

duplicate records, undesirable characters, out-of-range values

incomplete:

lacking attribute values,

lacking certain attributes of interest, or

containing only aggregate data

noisy: data containing random errors

outliers: data points deviating from the expected range

inconsistent: containing discrepancies in codes or names

Dirty Data Useless Analysis!

4

Goal of Data Cleaning

The goal of the data cleaning process is to preserve meaningful data while removing elements that may impede our ability to run the analyses effectively or otherwise affect the quality of the analysis results.

5

Data cleaning doesnt come easy

Cleaning data and preparing it for analysis is one of those thankless jobs that is dull, laborious, and painstaking !! WHY?

lack of an easy, sensible, and common process

No means of automation -- until now ?

people rarely document their datasets effectively

its easy to lose your (decimal) place if you get distracted or have to correct a mistake you made several changes earlier

.

6

Data Cleaning Tasks

Eliminate duplicate records, undesirable characters within cells, out-of-range values

Fill missing values

Identify and remove outliers

Smooth out noisy data

Correct inconsistent data

7

Data Cleaning using Excel

Identify and remove duplicate records

Data formatting

Remove undesirable characters

Locate out of range values

Parse data

Combine data elements that are stored across multiple columns into one column.

Translate Values

8

8

Handling duplicate records

Use RemoveDuplicate option in Data Tab for exact duplicates

OR

Identify a key field/attribute and sort

Use the function Exact(R2,R3)

You may use conditional formatting to colour the TRUE values for identification

9

Formatting Textual Data

UPPER(), LOWER() and PROPER() functions can be used

TRIM : which removes all spaces except single spaces between words

10

Formatting Numbers, Dates etc.

Formatting

Number : Appropriate decimal points

Currency

Date formatting

Percentage etc.

11

Remove undesirable characters

Data may have undesirable characters that are useful for visually displaying the information but useless in statistical analyses

Consider the phone number column, for example. We want to remove the unwanted periods, dashes, and parentheses so that every phone number contains numbers only.

0495-2809-254

Use Excel Find/Replace functionality

You may use the macro functionality to automate the process

12

Locate out-of-range values

Ensuring that your data are in the correct range is critical

Data Validation Tool may be used (doesnt exactly clean an XL spreadsheet)

13

Parse data

Text to columns splits cell data at delimiters that you specify, such as commas, tabs, or spaces, and puts the output into separate cells.

Consider, for example, City, State, ZIP data.

We can parse this into separate columns by running Text to Columns utility and selecting the comma as a delimiter to separate the city name from the state and ZIP code

LEFT(), RIGHT(), and MID() functions may be used

14

Combining Data Elements Across Multiple Columns

Use the =Concatenate formula.

Use Copy | Paste Special to copy the newly calculated values to another blank column.

Use Find | Replace to change the merged values with valid values, if needed.

15

15

Translating Values : VLOOKUP Function

Search for a value in the leftmost column of a table and return a value from a different column, from the same row as the value found

=VLOOKUP(lookup_value,table_array,col_index_num,[range_lookup])

the value to look up in the first column ofthe table

the lookup table or range of information

the column number in the table that contains the value to be returned

False for an exact matchTrue is default

Brackets [ ] indicate an optional argument

16

Handling Missing Data

Data is not always available

E.g. Many tuples have no recorded value for several attributes, such as customer income in sales data

Missing values and default values could be indistinguishable

Understanding pattern of missing data may unearth data integrity issues

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding or lack of importance

Missing data may need to be inferred from the existing data.

17

Missing Data Example

How should we handle this?
IDAgeSexEdu.RaceABCD170F16W3.8485270F16W0.6111360M12B1.1231485F21W1.3231570M21W243
18

Detecting Missing Data

Explicitly missing data

Match data specifications against data - are all the attributes present?

Scan individual records - are there gaps?

Rough checks : number of records, number of duplicates

Hidden damage to data

Missing values and defaults are indistinguishable - too many missing values ? metadata or domain expertise can help

Errors of omission : Eg. all call logs from a particular area are missing - check if data is missing randomly or intentional

19

19

Sanity checks that a statistician usually performs, often by hand.

How to deal with missing data

Do nothing

Exclude subjects with missing values

Ignore the tuple

Expand the results from the sub-sample to the whole sample

Make a guess, replace with the guessed values

Fill in with simple imputation

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g., unknown, a new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter

20

20

Fill in with better guessed values

Sophisticated Imputation

Propensity Score

Regression methods

predict value based on regression equation with other variables as predictors

21

Missing Mechanism

Missing Completely At Random (MCAR)

The best scenario

Simple approaches can yield unbiased results

Missing At Random (MAR)

The less ideal scenario

More advanced approaches are necessary; can yield unbiased results

Not Missing At Random (NMAR)

The worse scenario

No approaches can help; will end up in biased results always

22

22

Missing Mechanism - cont.

How do we determine the missing mechanism?

Since missing information is not observed, we really dont know how the complete sample looks like, and thus we cant say for sure the missing data are MCAR, MAR, or NMAR

Can we guess these missing values ? Examples:

Gender missing because the responder forgets to fill in the information

Previous order number missing because some people dont like to answer some of the questions

Income missing because the patients income is extremely high and he/she doesnt want to reveal

23

23

Missing Mechanism - cont.

Probably of MCAR is often low.

MAR or NMAR? Most of the time we cant really determine which of these

A common approach: unless a missing value is clearly NMAR (e.g. income), we would assume MAR on the missing data

Major Challenges:

How to identify similar slices from which we can impute the missing value

How to impute the missing values

24

24

Simple Imputation

Standalone imputation

Mean, median, other point estimates

What is the assumption you make here ?

Distribution of the missing values is the same as present values

Does not take into account inter-relationships

Convenient, easy to implement

25

25

Missing at random.

Imputation example

How to impute the missing A value for patient # 5?

Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7

By age: (3.8+0.6)/2 = 2.2

By sex: 1.1

By education: 1.3

By race: (3.8 + 0.6 + 1.3)/3 = 1.9

By B: (1.1 + 1.3)/2 = 1.2

Who is/are in the same slice with #5?
IDAgeSexEdu.RaceABCD170F16W3.8485270F16W0.6111360M12B1.1231485F21W1.3231570M21W243
26

26

Noisy Data

Noise:

random error in a measured variable

Incorrect attribute values may be due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

27

How to Handle Noisy Data?

Combined computer and human inspection

detect suspicious values and check and correct by human

Binning method:

first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regression

smooth by fitting the data into regression functions

28

Detecting Noisy Data

Compare estimates (averages, frequencies, medians) with expected values and bounds; check at various levels of granularity. Aggregates can be misleading

Check for spikes and dips in distributions and histograms

Too many default values? metadata or domain expertise can help

29

29

Sanity checks that a statistician usually performs, often by hand.

Simple Methods: Binning

Equal-width (distance) partitioning:

It divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward

But outliers may dominate

Skewed data is not handled well

Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing approximately same number of samples

30

Binning Methods for Data Smoothing

* Sorted data for price :- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (Equal width) bins:

- Bin 1 (4-14): 4, 8, 9

- Bin 2(15-24): 15, 21, 21, 24

- Bin 3(25-34): 25, 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 7, 7, 7

- Bin 2: 20, 20, 20, 20

- Bin 3: 28, 28, 28, 28, 28

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4

- Bin 2: 15, 24, 24, 24

- Bin 3: 25, 25, 25, 25, 34

31

Binning Methods for Data Smoothing (continued)

* Sorted data for price :- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (Equal-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

32

Outliers

Outlier departure from the expected behaviour

Data points inconsistent with the majority of data

Defining the expected is crucial in outlier detection

33

Outlier : Suspicious Data

Consider the data points

3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92

92 is suspicious - an outlier

Often they are data glitches

OR

They could be a data miners dream :

Eg. Highly profitable customers

34

Outlier Detection Methods

Model Based Approaches

Deviation-based Approaches

35

Model Fitting Based Approach

Models summarize general trends in data

more complex than simple aggregates

Eg. Regression

Data points that do not conform to well fitting models are potential outliers

36

36

Do two things.

Assume a model to find outliers.

GOF : But does the model really fit?

Deviation-based Approaches

General idea

Given a set of data points (local group or global set)

Outliers are points that do not fit to the general characteristics of that set, i.e., the variance of the set is minimized when removing the outliers

Basic assumption

Outliers are the outermost points of the data set

37

Censoring and Truncation

Censored - Measurement is bounded but not precise e.g. Call duration > 200 are recorded as 200

Truncated - Data point dropped if it exceeds or falls below a certain bound e.g. customers with less than 2 minutes of calling per month

38

Summary (so far)

Data Cleaning is a crucial pr-processing step for Data Analysis

Issues involved may range from simple formatting to outliers

Multiple approaches available for each problem

Select the one best matching the problem detected

39

40

Data Integration

Data integration:

Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id B.cust-#

Integrate metadata from different sources

Entity identification problem:

Identify real world entities from multiple data sources,

e.g., Calicut = Kozhikode

Detecting and resolving data value conflicts (Inconsistencies)

For the same real world entity, attribute values from different sources are different

Possible reasons: different representations, different scales, Eg. Miles vs. Kilometer

41

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

Object identification: The same attribute or object may have different names in different databases

Derivable data: One attribute may be a derived attribute in another table, e.g., annual revenue

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies

42

Data Transformation

Normalization: scaled to fall within a small, specified range

min-max normalization

z-score normalization

Attribute/feature construction

New attributes constructed from the given ones

43

Data Transformation: Normalization

Min-max normalization: to [new_minA, new_maxA]

Eg. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

44

Data Transformation: Normalization

Z-score normalization (: mean, : standard deviation) :

Eg. Let = 54,000, = 16,000. Then 73,600 becomes

Thank You

45

716

.

0

0

)

0

0

.

1

(

000

,

12

000

,

98

000

,

12

600

,

73

=

+

-

-

-

A

A

A

A

A

A

min

new

min

new

max

new

min

max

min

v

v

_

)

_

_

(

'

+

-

-

-

=

A

A

v

v

s

m

-

=

'

225

.

1

000

,

16

000

,

54

600

,

73

=

-

BC 2014 Session2

Documents

Transcript of BC 2014 Session2