ESA Ignite talk on quality control for data

Post on 10-May-2015

353 views 1 download

Tags:

description

Talk on Quality Control and Quality Assurance for ecological data, presented as an ignite talk for ESA 2014 meeting in Sacramento CA 12 Aug 2014

Transcript of ESA Ignite talk on quality control for data

Ensuring That Your Data are High Quality

Carly Strasser | @carlystrasser California Digital Library

ESA 2014

Quality assurance & control: Mechanisms for preventing errors from entering a data set

Quality assurance & control:

Plan

Collect

Assure

Describe

Preserve

Discover

Integrate

Analyze

Why?

Why?

From:

Prevent Minimize

Detect Handle

4 Strategies:

From Flickr by Elliott Teel

Prevent Errors Before Collection

•  Define & enforce standards •  Formats

•  Codes

•  Measurement units

•  Metadata

From

Flic

kr b

y S

taci

eBee

Prevent Errors Before Collection

•  Define & enforce standards •  Formats

•  Codes

•  Measurement units

•  Metadata

•  Assign responsibility

for data quality

From

Flic

kr b

y S

taci

eBee

Comments & notes fields Allows handling of unexpected situations

Prevent Errors Before Collection

Allow “other” values

From Flickr by Olga Nohra

Minimize Errors During Collection

•  Eliminate manual data entry •  Design data storage well

•  Minimize repeat entry

•  Use consistent terminology •  Atomize data

From Flickr by Butal Lee

You should invest time in learning databases if your data sets are large or complex

Consider investing time in learning databases if your data are small and humble

you ever intend to share your data you are < 30 years old

From Mark Schildhauer

Minimize Errors: Use databases

Databases •  FileMaker Pro (Mac)

•  Access (PC)

•  LibreOffice

Minimize Errors: Tools

Databases •  FileMaker Pro (Mac)

•  Access (PC)

•  LibreOffice

Spreadsheets •  Google forms

•  LibreOffice

•  Lists & data validation in Excel

Minimize Errors: Tools

Detect Errors After Collection Look for outliers

Goal is not to eliminate outliers but to identify potential data contamination

0

10

20

30

40

50

60

0 10 20 30 40

Detect Errors After Collection Look for outliers

Goal is not to eliminate outliers but to identify potential data contamination

Strategies •  Normal probability plots

•  Regression

•  Scatter plots •  Maps

0

10

20

30

40

50

60

0 10 20 30 40

Handle Errors •  Case-by-case decision •  Flag them?

•  Remove them?

•  Fix them?

•  Document all changes readme.txt, scripts

Handle Errors •  Case-by-case decision •  Flag them?

•  Remove them?

•  Fix them?

•  Document all changes readme.txt, scripts

•  Keep original data separate

•  Use scripts

Raw data as .csv

R script for QAQC

Prevent Minimize

Detect Handle

4 Strategies:

From Flickr by Elliott Teel

Website Email

Twiter Slides

carlystrasser.net carlystrasser@gmail.com @carlystrasser slideshare.net/carlystrasser