data tidying

11
LESSON 5: “Data Tidying in Practice” 5 Key characteristics of “messy” or data that is not tidy 4 Common times when data errors are likely to be introduced “Tidy” data is key to ensure error- free analysis Consistency, accuracy are hall- marks of tidy data Simple philosophy can address data quality concerns

Transcript of data tidying

Page 1: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 1/13

LESSON 5: 

“Data Tidying in Practice”

5Key characteristicsof “messy” or data

that is not tidy

4Common times whendata errors are likely

to be introduced

“Tidy” data is keyto ensure error-

free analysis

Consistency,accuracy are hall-marks of tidy data

Sicaq

Page 2: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 2/13

Page 3: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 3/13

Page 4: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 4/13

“Tidy” data is important to ensure error-free analy

! Each variable you measure should be in one column 

! Each di#erent observation should be in a di#erent row 

! There should be one table for each type of variable  

! If you have multiple tables, they should include a key in

tables that allow them to be linked

Source: Leek “How To Share Data With A Statistician” (2013)

Page 5: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 5/13

“Tidy

Page 6: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 6/13

“Mess

Page 7: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 7/13

Wickham identifies common characteristics

of messy data

! Multiple variables are stored in one column 

! Variables are stored in both rows and columns 

! Column headers are values, not variable names 

! Multiple types of observational units are stored in the

table 

!  A single observational unit is stored in multiple tables

Source: Wickham, “Tidy Data” Journal of Statistical Software August 2014, Volume 59, Issue 10

Page 8: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 8/13

Follow these guidelines to turn messy

data into tidy data

! Keep a close eye on your data when importing data 

! Produce a new file at each step of the tidying process

unique extensions (e.g., “_ORIGINAL”, “_COL-CLEAN

“_ROW-CLEAN” or “_V1”, “_V2”, etc.) 

! (In a table of numbers) Flush out errors by summing ro

and columns and comparing results 

! Understand where errors are commonly introduced an

proactively work to limit mistakes from occurring

Page 9: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 9/13

Understand common data errors to

identify trouble spots

! Measurement errors (“bias”) when data collected from

sample is not reflective of the population

! Data entry errors that occur when data is incorrectly

recorded  

! Data integration errors when data imported or joined

multiple databases does not integrate smoothly

! Calculation errors when inaccurate formulas or

manipulations produce unintended outputs

Page 10: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 10/13

Kaushak offers a plain philosophy on data quality

!  Assume those managing the data have a level of comf

with the data (i.e., trust them)  

! Start making decisions that you are comfortable with  

!

Over time drill deeper in micro specific areas and learn

! Get more comfortable with data and its limitations ove

! Consistency in calculations = Good

Source: Kaushak, “Data Quality Sucks, Let's Just Get Over It” (2006)

Page 11: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 11/13

LESSON 5: 

“Data Tidying in Practice”

5Key characteristicsof “messy” or data

that is not tidy

4Common times whendata errors are likely

to be introduced

“Tidy” data is keyto ensure error-

free analysis

Consistency,accuracy are hall-marks of tidy data

Sicaq

Page 12: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 12/13

Supplemental reading for this lesson

! Data Quality Sucks, Let’s Just Get Over It:http://www.kaushik.net/avinash/data-quality-sucks-let

 just-get-over-it/  

! Tidy Data:http://www.jstatsoft.org/v59/i10/paper 

Page 13: data tidying

7/25/2019 data tidying

http://slidepdf.com/reader/full/data-tidying 13/13

References

1.Je# Leek. 2013. “How to Share Data with A Statistian”Retrieved from https://github.com/jtleek/datasharing 

2. Hadley Wickham, Tidy Data. The Journal of Statistical

Software, Vol. 59, Issue 10, Sep 2014. Retrieved from

www.jstatsoft.org/v59/i10/paper  

3. Avinash Kaushik. 2006. “Data Quality Sucks, Let's Jus

Over It”. Retrieved from http://www.kaushik.net/avinas

data-quality-sucks-lets-just-get-over-it/