Lisa Almond FMS 2015. CFR 2014/15 Spring 2015 Tidying FMS users FMS Contents.
data tidying
Transcript of data tidying
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 1/13
LESSON 5:
“Data Tidying in Practice”
5Key characteristicsof “messy” or data
that is not tidy
4Common times whendata errors are likely
to be introduced
“Tidy” data is keyto ensure error-
free analysis
Consistency,accuracy are hall-marks of tidy data
Sicaq
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 2/13
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 3/13
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 4/13
“Tidy” data is important to ensure error-free analy
! Each variable you measure should be in one column
! Each di#erent observation should be in a di#erent row
! There should be one table for each type of variable
! If you have multiple tables, they should include a key in
tables that allow them to be linked
Source: Leek “How To Share Data With A Statistician” (2013)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 5/13
“Tidy
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 6/13
“Mess
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 7/13
Wickham identifies common characteristics
of messy data
! Multiple variables are stored in one column
! Variables are stored in both rows and columns
! Column headers are values, not variable names
! Multiple types of observational units are stored in the
table
! A single observational unit is stored in multiple tables
Source: Wickham, “Tidy Data” Journal of Statistical Software August 2014, Volume 59, Issue 10
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 8/13
Follow these guidelines to turn messy
data into tidy data
! Keep a close eye on your data when importing data
! Produce a new file at each step of the tidying process
unique extensions (e.g., “_ORIGINAL”, “_COL-CLEAN
“_ROW-CLEAN” or “_V1”, “_V2”, etc.)
! (In a table of numbers) Flush out errors by summing ro
and columns and comparing results
! Understand where errors are commonly introduced an
proactively work to limit mistakes from occurring
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 9/13
Understand common data errors to
identify trouble spots
! Measurement errors (“bias”) when data collected from
sample is not reflective of the population
! Data entry errors that occur when data is incorrectly
recorded
! Data integration errors when data imported or joined
multiple databases does not integrate smoothly
! Calculation errors when inaccurate formulas or
manipulations produce unintended outputs
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 10/13
Kaushak offers a plain philosophy on data quality
! Assume those managing the data have a level of comf
with the data (i.e., trust them)
! Start making decisions that you are comfortable with
!
Over time drill deeper in micro specific areas and learn
! Get more comfortable with data and its limitations ove
! Consistency in calculations = Good
Source: Kaushak, “Data Quality Sucks, Let's Just Get Over It” (2006)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 11/13
LESSON 5:
“Data Tidying in Practice”
5Key characteristicsof “messy” or data
that is not tidy
4Common times whendata errors are likely
to be introduced
“Tidy” data is keyto ensure error-
free analysis
Consistency,accuracy are hall-marks of tidy data
Sicaq
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 12/13
Supplemental reading for this lesson
! Data Quality Sucks, Let’s Just Get Over It:http://www.kaushik.net/avinash/data-quality-sucks-let
just-get-over-it/
! Tidy Data:http://www.jstatsoft.org/v59/i10/paper
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 13/13
References
1.Je# Leek. 2013. “How to Share Data with A Statistian”Retrieved from https://github.com/jtleek/datasharing
2. Hadley Wickham, Tidy Data. The Journal of Statistical
Software, Vol. 59, Issue 10, Sep 2014. Retrieved from
www.jstatsoft.org/v59/i10/paper
3. Avinash Kaushik. 2006. “Data Quality Sucks, Let's Jus
Over It”. Retrieved from http://www.kaushik.net/avinas
data-quality-sucks-lets-just-get-over-it/