ESA Ignite talk on quality control for data
-
Upload
carly-strasser -
Category
Science
-
view
353 -
download
1
description
Transcript of ESA Ignite talk on quality control for data
Ensuring That Your Data are High Quality
Carly Strasser | @carlystrasser California Digital Library
ESA 2014
Quality assurance & control: Mechanisms for preventing errors from entering a data set
Quality assurance & control:
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
Why?
Why?
From:
Prevent Minimize
Detect Handle
4 Strategies:
From Flickr by Elliott Teel
Prevent Errors Before Collection
• Define & enforce standards • Formats
• Codes
• Measurement units
• Metadata
From
Flic
kr b
y S
taci
eBee
Prevent Errors Before Collection
• Define & enforce standards • Formats
• Codes
• Measurement units
• Metadata
• Assign responsibility
for data quality
From
Flic
kr b
y S
taci
eBee
Comments & notes fields Allows handling of unexpected situations
Prevent Errors Before Collection
Allow “other” values
From Flickr by Olga Nohra
Minimize Errors During Collection
• Eliminate manual data entry • Design data storage well
• Minimize repeat entry
• Use consistent terminology • Atomize data
From Flickr by Butal Lee
You should invest time in learning databases if your data sets are large or complex
Consider investing time in learning databases if your data are small and humble
you ever intend to share your data you are < 30 years old
From Mark Schildhauer
Minimize Errors: Use databases
Databases • FileMaker Pro (Mac)
• Access (PC)
• LibreOffice
Minimize Errors: Tools
Databases • FileMaker Pro (Mac)
• Access (PC)
• LibreOffice
Spreadsheets • Google forms
• LibreOffice
• Lists & data validation in Excel
Minimize Errors: Tools
Detect Errors After Collection Look for outliers
Goal is not to eliminate outliers but to identify potential data contamination
0
10
20
30
40
50
60
0 10 20 30 40
Detect Errors After Collection Look for outliers
Goal is not to eliminate outliers but to identify potential data contamination
Strategies • Normal probability plots
• Regression
• Scatter plots • Maps
0
10
20
30
40
50
60
0 10 20 30 40
Handle Errors • Case-by-case decision • Flag them?
• Remove them?
• Fix them?
• Document all changes readme.txt, scripts
Handle Errors • Case-by-case decision • Flag them?
• Remove them?
• Fix them?
• Document all changes readme.txt, scripts
• Keep original data separate
• Use scripts
Raw data as .csv
R script for QAQC
Prevent Minimize
Detect Handle
4 Strategies:
From Flickr by Elliott Teel
Website Email
Twiter Slides
carlystrasser.net [email protected] @carlystrasser slideshare.net/carlystrasser