Introduction to data pre-processing and cleaning

14
Data Preparation and Cleaning February 22, 2016 Matteo Manca [email protected]

Transcript of Introduction to data pre-processing and cleaning

Page 1: Introduction to data pre-processing and cleaning

Data Preparation and Cleaning

February 22, 2016

Matteo Manca [email protected]

Page 2: Introduction to data pre-processing and cleaning

Matteo Manca Researcher @ Eurecat (Social Media group)- BCN

PhD @ Cagliari – Italy

Research interests:• social media mining, • social networks analysis • computational social science• data Science

Contacts:[email protected]

https://mattemanca.wordpress.com

Matteo Manca [email protected]

Page 3: Introduction to data pre-processing and cleaning

Índice del capítulo

1

3

• Topic 1: Big Data Economy • Topic 2: Environment • Topic 3: Data Exploration • Topic 4: Data Ingestion & Storage• Topic 5: Data Preparation — Cleaning• Topic 6: Distributed Systems (Hadoop)• Topic 7: Distributed Analytics (PIG)

Topics

Big data

Matteo Manca [email protected]

Page 4: Introduction to data pre-processing and cleaning

• Why are we interested on Data preparation and Cleaning?• Introduction to Data pre-processing and Cleaning ( main

concepts, and main steps)• Best practices• Data Pre-processing and Cleaning in R: Step-by-Step

Tutorial

Data Preparation — Cleaning

Matteo Manca [email protected]

Page 5: Introduction to data pre-processing and cleaning

Why are we interested on Data pre-processing and Cleaning? Let’s analyse our data!!

1. Average test score?2. Most common

year?3. % of male and

female?

5

Raw data

Matteo Manca [email protected]

Page 6: Introduction to data pre-processing and cleaning

Why are we interested on Data pre-processing and Cleaning?

6

Raw data• Incomplete: lacking attribute

values, lacking certain attributes of interest, or containing only aggregate data

• Noisy: containing errors or outliers

• Inconsistent: containing discrepancies in codes or names• Data analyst spends much if not most of his time on

preparing the data before doing the analysis• 80% of data mining and analysis is really data

preparation.Matteo Manca [email protected]

Page 7: Introduction to data pre-processing and cleaning

Data Pre-processing and Cleaning

7

Process of transforming raw data into consistent data that can be analyzed.

Consistent data is the stage where data is ready for the analysis

Main steps:• Handle missing values

(ignore the tuple, fill missing value with mean/mode value, predict it,etc.)

• identify or remove outliers• resolve inconsistencies.• Data transformation:

normalization and aggregation

Data pre-processing and

cleaning

Raw data

Raw data

Consistent data

Matteo Manca [email protected]

Page 8: Introduction to data pre-processing and cleaning

Data Pre-processing and Cleaning

8

Consistent Data• Each variable you measure

should be in one column• Each different observation

(record) should be in a different row

• If we are working with different variables there should be different data frames linked each other

Data pre-processing and

cleaning

Raw data

Raw data

Consistent data

Matteo Manca [email protected]

Page 9: Introduction to data pre-processing and cleaning

Data Pre-processing and Cleaning

9

Best practices• Pipeline: a explicit “recipe” used

to go from step i to step i+1 (all steps should be recorded)

• A code book that describes each variable and its values in the tidy dataset

• Use make variable names human readable

• save your clean / consistent data to files to avoid to repeat each time the pre-process and DC (one file per data frame / table)

• Markdown (. md) files usually are used (https://en.wikipedia.org/wiki/Markdown)

Data pre-processing and

cleaning

Raw data

Raw data

Consistent data

Matteo Manca [email protected]

Page 10: Introduction to data pre-processing and cleaning

Data Pre-processing and Cleaning in R

10

Rstudio is a user interface for R.https://www.rstudio.com

Matteo Manca [email protected]

R is a free software environment for statistical computing and graphics(https://www.r-project.org)

Page 11: Introduction to data pre-processing and cleaning

Questions ?

Matteo Manca [email protected]

Page 12: Introduction to data pre-processing and cleaning

12Matteo Manca [email protected]

Page 13: Introduction to data pre-processing and cleaning

Data Pre-processing and Cleaning

© 2015, Barcelona Technology School ed X.X DD/MM/2015www.barcelonatechnologyschoo.com