BC 2014 Session2
-
Upload
abhilash-bhat -
Category
Documents
-
view
226 -
download
1
description
Transcript of BC 2014 Session2
No Slide Title
Wastage to the US economy of $3 trillion.
15-20% of IT operating budget of most corporates
Implicated in the Challenger shuttle and Iranian passenger jet shootdown incidents.
Clue: 50-80% of a data mining projects time may be spent doing this.
Whodunit ? Again ?
1
2
Data Cleaning
3
Why Data Cleaning?
Data in the real world is often dirty
duplicate records, undesirable characters, out-of-range values
incomplete:
lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data
noisy: data containing random errors
outliers: data points deviating from the expected range
inconsistent: containing discrepancies in codes or names
Dirty Data Useless Analysis!
4
Goal of Data Cleaning
The goal of the data cleaning process is to preserve meaningful data while removing elements that may impede our ability to run the analyses effectively or otherwise affect the quality of the analysis results.
5
Data cleaning doesnt come easy
Cleaning data and preparing it for analysis is one of those thankless jobs that is dull, laborious, and painstaking !! WHY?
lack of an easy, sensible, and common process
No means of automation -- until now ?
people rarely document their datasets effectively
its easy to lose your (decimal) place if you get distracted or have to correct a mistake you made several changes earlier
.
6
Data Cleaning Tasks
Eliminate duplicate records, undesirable characters within cells, out-of-range values
Fill missing values
Identify and remove outliers
Smooth out noisy data
Correct inconsistent data
7
Data Cleaning using Excel
Identify and remove duplicate records
Data formatting
Remove undesirable characters
Locate out of range values
Parse data
Combine data elements that are stored across multiple columns into one column.
Translate Values
8
8
Handling duplicate records
Use RemoveDuplicate option in Data Tab for exact duplicates
OR
Identify a key field/attribute and sort
Use the function Exact(R2,R3)
You may use conditional formatting to colour the TRUE values for identification
9
Formatting Textual Data
UPPER(), LOWER() and PROPER() functions can be used
TRIM : which removes all spaces except single spaces between words
10
Formatting Numbers, Dates etc.
Formatting
Number : Appropriate decimal points
Currency
Date formatting
Percentage etc.
11
Remove undesirable characters
Data may have undesirable characters that are useful for visually displaying the information but useless in statistical analyses
Consider the phone number column, for example. We want to remove the unwanted periods, dashes, and parentheses so that every phone number contains numbers only.
0495-2809-254
Use Excel Find/Replace functionality
You may use the macro functionality to automate the process
12
Locate out-of-range values
Ensuring that your data are in the correct range is critical
Data Validation Tool may be used (doesnt exactly clean an XL spreadsheet)
13
Parse data
Text to columns splits cell data at delimiters that you specify, such as commas, tabs, or spaces, and puts the output into separate cells.
Consider, for example, City, State, ZIP data.
We can parse this into separate columns by running Text to Columns utility and selecting the comma as a delimiter to separate the city name from the state and ZIP code
LEFT(), RIGHT(), and MID() functions may be used
14
Combining Data Elements Across Multiple Columns
Use the =Concatenate formula.
Use Copy | Paste Special to copy the newly calculated values to another blank column.
Use Find | Replace to change the merged values with valid values, if needed.
15
15
Translating Values : VLOOKUP Function
Search for a value in the leftmost column of a table and return a value from a different column, from the same row as the value found
=VLOOKUP(lookup_value,table_array,col_index_num,[range_lookup])
the value to look up in the first column ofthe table
the lookup table or range of information
the column number in the table that contains the value to be returned
False for an exact matchTrue is default
Brackets [ ] indicate an optional argument
16
Handling Missing Data
Data is not always available
E.g. Many tuples have no recorded value for several attributes, such as customer income in sales data
Missing values and default values could be indistinguishable
Understanding pattern of missing data may unearth data integrity issues
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding or lack of importance
Missing data may need to be inferred from the existing data.
17
Missing Data Example
How should we handle this?
IDAgeSexEdu.RaceABCD170F16W3.8485270F16W0.6111360M12B1.1231485F21W1.3231570M21W24318
Detecting Missing Data
Explicitly missing data
Match data specifications against data - are all the attributes present?
Scan individual records - are there gaps?
Rough checks : number of records, number of duplicates
Hidden damage to data
Missing values and defaults are indistinguishable - too many missing values ? metadata or domain expertise can help
Errors of omission : Eg. all call logs from a particular area are missing - check if data is missing randomly or intentional
19
19
Sanity checks that a statistician usually performs, often by hand.
How to deal with missing data
Do nothing
Exclude subjects with missing values
Ignore the tuple
Expand the results from the sub-sample to the whole sample
Make a guess, replace with the guessed values
Fill in with simple imputation
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., unknown, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter
20
20
Fill in with better guessed values
Sophisticated Imputation
Propensity Score
Regression methods
predict value based on regression equation with other variables as predictors
21
Missing Mechanism
Missing Completely At Random (MCAR)
The best scenario
Simple approaches can yield unbiased results
Missing At Random (MAR)
The less ideal scenario
More advanced approaches are necessary; can yield unbiased results
Not Missing At Random (NMAR)
The worse scenario
No approaches can help; will end up in biased results always
22
22
Missing Mechanism - cont.
How do we determine the missing mechanism?
Since missing information is not observed, we really dont know how the complete sample looks like, and thus we cant say for sure the missing data are MCAR, MAR, or NMAR
Can we guess these missing values ? Examples:
Gender missing because the responder forgets to fill in the information
Previous order number missing because some people dont like to answer some of the questions
Income missing because the patients income is extremely high and he/she doesnt want to reveal
23
23
Missing Mechanism - cont.
Probably of MCAR is often low.
MAR or NMAR? Most of the time we cant really determine which of these
A common approach: unless a missing value is clearly NMAR (e.g. income), we would assume MAR on the missing data
Major Challenges:
How to identify similar slices from which we can impute the missing value
How to impute the missing values
24
24
Simple Imputation
Standalone imputation
Mean, median, other point estimates
What is the assumption you make here ?
Distribution of the missing values is the same as present values
Does not take into account inter-relationships
Convenient, easy to implement
25
25
Missing at random.
Imputation example
How to impute the missing A value for patient # 5?
Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7
By age: (3.8+0.6)/2 = 2.2
By sex: 1.1
By education: 1.3
By race: (3.8 + 0.6 + 1.3)/3 = 1.9
By B: (1.1 + 1.3)/2 = 1.2
Who is/are in the same slice with #5?
IDAgeSexEdu.RaceABCD170F16W3.8485270F16W0.6111360M12B1.1231485F21W1.3231570M21W24326
26
Noisy Data
Noise:
random error in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
27
How to Handle Noisy Data?
Combined computer and human inspection
detect suspicious values and check and correct by human
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
28
Detecting Noisy Data
Compare estimates (averages, frequencies, medians) with expected values and bounds; check at various levels of granularity. Aggregates can be misleading
Check for spikes and dips in distributions and histograms
Too many default values? metadata or domain expertise can help
29
29
Sanity checks that a statistician usually performs, often by hand.
Simple Methods: Binning
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.
The most straightforward
But outliers may dominate
Skewed data is not handled well
Equal-depth (frequency) partitioning:
It divides the range into N intervals, each containing approximately same number of samples
30
Binning Methods for Data Smoothing
* Sorted data for price :- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (Equal width) bins:
- Bin 1 (4-14): 4, 8, 9
- Bin 2(15-24): 15, 21, 21, 24
- Bin 3(25-34): 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
31
Binning Methods for Data Smoothing (continued)
* Sorted data for price :- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (Equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
32
Outliers
Outlier departure from the expected behaviour
Data points inconsistent with the majority of data
Defining the expected is crucial in outlier detection
33
Outlier : Suspicious Data
Consider the data points
3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92
92 is suspicious - an outlier
Often they are data glitches
OR
They could be a data miners dream :
Eg. Highly profitable customers
34
Outlier Detection Methods
Model Based Approaches
Deviation-based Approaches
35
Model Fitting Based Approach
Models summarize general trends in data
more complex than simple aggregates
Eg. Regression
Data points that do not conform to well fitting models are potential outliers
36
36
Do two things.
Assume a model to find outliers.
GOF : But does the model really fit?
Deviation-based Approaches
General idea
Given a set of data points (local group or global set)
Outliers are points that do not fit to the general characteristics of that set, i.e., the variance of the set is minimized when removing the outliers
Basic assumption
Outliers are the outermost points of the data set
37
Censoring and Truncation
Censored - Measurement is bounded but not precise e.g. Call duration > 200 are recorded as 200
Truncated - Data point dropped if it exceeds or falls below a certain bound e.g. customers with less than 2 minutes of calling per month
38
Summary (so far)
Data Cleaning is a crucial pr-processing step for Data Analysis
Issues involved may range from simple formatting to outliers
Multiple approaches available for each problem
Select the one best matching the problem detected
39
40
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources,
e.g., Calicut = Kozhikode
Detecting and resolving data value conflicts (Inconsistencies)
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, Eg. Miles vs. Kilometer
41
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different names in different databases
Derivable data: One attribute may be a derived attribute in another table, e.g., annual revenue
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies
42
Data Transformation
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
Attribute/feature construction
New attributes constructed from the given ones
43
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
Eg. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
44
Data Transformation: Normalization
Z-score normalization (: mean, : standard deviation) :
Eg. Let = 54,000, = 16,000. Then 73,600 becomes
Thank You
45
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73
=
+
-
-
-
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v
_
)
_
_
(
'
+
-
-
-
=
A
A
v
v
s
m
-
=
'
225
.
1
000
,
16
000
,
54
600
,
73
=
-