A basic course on Research data management, part 4: caring for your data, or making data reusable

A basic course on Research data management

part 4: caring for your data, or making data reusablePROOF course Information Literacy and Research Data Management

TU/e, 24-01-2017

[email protected], TU/e IEC/Library

Available under CC BY-SA license, which permits copying and redistributing the material in any medium or format & adapting the material for any purpose, provided the original author and source are credited & you distribute the adapted material under the same license as the original

http://w3.tue.nl/en/services/library/

http://creativecommons.org/licenses/by-sa/4.0/

Research data management Sharing your data, or making your data findable and accessible

with good data practices+ protecting your data: back up, access control; file naming, organizing

data, versioning+ sharing your data via collaboration platforms and archives

→ Caring for your data, or making your data reusable and interoperable with good data practices+ metadata, tidy data, licenses

Research data managementwhat was it again

Before data can be reusable, it has first to be usable

What is the nature of the “unusual episode” to which this table refers?

Raw data: https://www.amstat.org/publications/jse/datasets/titanic.dat.txt

Documentation accompanying the data:

https://www.amstat.org/publications/jse/datasets/titanic.txt

Size (number of observations and variables)

Description

Provenance

Variable descriptions

Based on:

The "Unusual Episode" Data Revisited / by Robert J. MacG. Dawson, in: Journal of Statistics Education vol. 3(1995), issue 3

https://www.amstat.org/publications/jse/datasets/titanic.dat.txt


https://www.amstat.org/publications/jse/v3n3/datasets.dawson.html



1. Morphological Measurements of Galapagos Finches

http://dx.doi.org/10.5061/dryad.152

Use of standard names (taxonomy, species)

Variable names clear enough? WingL must be wing length but what is N.Ubkl?

Units of measurement?

Based on:

Looking after datasets / by Antony Unwin, 01-09-2015, http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html

2. Collaborative FAIR data sharing / by Henry Rzepa


http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html

http://www.ch.imperial.ac.uk/rzepa/blog/?p=16292

The welfare consequences… / by Jonathan J. Cooper et. al.,http://dx.doi.org/10.1371/journal.pone.0102722

Word.doc

These data are findable and accessible – but usable?

http://dx.doi.org/10.1371/journal.pone.0102722

Lessons learnedtable structure [ tidy data ]

To allow your data to be easily: imported by data management systems; analyzed by analysis software, and ; combined with other data (interoperability)make sure that: each row represents a single observation (record) each column represents a single variable (parameter) or type of measurement

(field) every cell contains only one piece of information (no highlighting of cells) there is only one table for each type of information (no multiple worksheets)

Cross-tab structure / contingency table: different columns contain measurements of the same variable: easier to read but difficult to add data (columns) to the records (rows). See Titanic table versus Titanic raw data

“The problem is that people like to view data in a totally different way than a computer likes to process it.” (Kien Leong)

http://vita.had.co.nz/papers/tidy-data.pdf

Lessons learnedtable metadata: variables (columns) and observations/records (rows)

include a row at the top of each table that contains full column (variable) names (no hard to understand abbreviations)

columns: use clear, descriptive variable names, avoid special characters (can cause problems with some software)

rows: if possible, use standard names within cells (derived from a taxonomy for example, standard species name, standard date formats, …)

try to avoid coding categorical or ordinal data as numbers missing data / null values: best option: use a blank

Lessons learneddata set metadata (documentation), discovery metadata, licenses

size of the data set: number of observations and variables explanation of the variables, how each was measured and its measurement

units (code book) provenance (origin) of the data, how you collected the data, data

manipulation steps (study design) description of the data set: what’s included and excluded, known problems

or inconsistencies in the data, why data are missing add license-information: what are others allowed to do with your data?

a simple readme file can be enough (see documentation Titanic dataset) but not always

“Research outputs that are poorly documented are like canned goods with the label removed (…)” (Carly Strasser)

http://dx.doi.org/10.17026/dans-xk5-y7vc

Lessons learnedlong term availability

if possible use a non-proprietary (open) file format (are easier to use in a variety of software), like csv for tabular data

if possible, take the preferred formats of a data archive in account.See for example 4TU.ResearchData overview of file formats and types of support: http://researchdata.4tu.nl/en/publishing-research/data-description-and-formats/

http://researchdata.4tu.nl/en/publishing-research/data-description-and-formats/

Toolsfor working with messy data

Excel vs scripting based software tools Excel: data provenance and documentation of data processing with a graphical user

interface is bad because it doesn’t leaves a record use a scripted language (R (free), Matlab, SAS…) to process data, run the analysis

and to produce final outputs

OpenRefine runs on your computer (not in the cloud), inside the Firefox browser (not in IE), no

web connection is needed working with OpenRefine: http://www.datacarpentry.org/OpenRefine-ecology/01-

working-with-openrefine.html captures all steps done to your raw data ; original dataset is not modified ; steps are

easily reversed ;Tabula “… tool for liberating data tables locked inside PDF files.”

A reproducible workflow (bartomeuslab)

http://production-scheduling.com/seven-deadly-spreadsheet-sins/

https://www.r-project.org/

http://openrefine.org/

http://www.datacarpentry.org/OpenRefine-ecology/01-working-with-openrefine.html

http://tabula.technology/

https://youtu.be/s3JldKoA0zw

Data Coach [ website ]TU/e data librarians ([email protected])

Leon Osinski, Sjef Öllers

Recommended readingVan den Eynden, Veerle e.a. (2011), Managing and sharing data: best practice for researchers, UK Data ArchiveStrasser, Carly (2015), Research data management, NISO

Recommended online courseEssentials 4 data support [English & Dutch]

Support

http://www.tue.nl/datacoach

mailto:[email protected]

http://www.data-archive.ac.uk/media/2894/managingsharing.pdf

http://www.niso.org/apps/group_public/download.php/15375/PrimerRDM-2015-0727.pdf

http://datasupport.researchdata.nl/en/

1. Overview research data storage services: http://dataservices.silk.co/2. Raw Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.dat.txt3. Documentation to Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.txt4. The “Unusual Episode Data“ revisited:

https://www.amstat.org/publications/jse/v3n3/datasets.dawson.html5. Morphological Measurements of Galapagos Finches: http://dx.doi.org/10.5061/dryad.1526. Looking after data sets: http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html7. Collaborative FAIR data sharing: http://www.ch.imperial.ac.uk/rzepa/blog/?p=162928. The welfare consequences… : http://dx.doi.org/10.1371/journal.pone.01027229. Tidy data: http://vita.had.co.nz/papers/tidy-data.pdf10. Data guide example: http://dx.doi.org/10.17026/dans-xk5-y7vc11. Preferred data formats of 4TU.ResearchData: http://researchdata.4tu.nl/en/publishing-research/data-

description-and-formats/12. Excel: http://production-scheduling.com/seven-deadly-spreadsheet-sins/13. R: https://www.r-project.org/

URL’s of mentioned webpagesin order of appearance #1

http://dataservices.silk.co/

https://www.amstat.org/publications/jse/datasets/titanic.dat.txt


https://www.amstat.org/publications/jse/v3n3/datasets.dawson.html


http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html

http://www.ch.imperial.ac.uk/rzepa/blog/?p=16292

http://dx.doi.org/10.1371/journal.pone.0102722

http://vita.had.co.nz/papers/tidy-data.pdf

http://dx.doi.org/10.17026/dans-xk5-y7vc

http://researchdata.4tu.nl/en/publishing-research/data-description-and-formats/

http://production-scheduling.com/seven-deadly-spreadsheet-sins/

https://www.r-project.org/

14. Bartolomeuslab, A reproducible workflow: https://youtu.be/s3JldKoA0zw15. OpenRefine: http://openrefine.org/16. Working with OpenRefine: http://www.datacarpentry.org/OpenRefine-ecology/01-working-with-

openrefine.html16. TU/e Data Coach: http://www.tue.nl/datacoach17. Carly Strasser, Research data management:

http://www.niso.org/apps/group_public/download.php/15375/PrimerRDM-2015-0727.pdf18. Online course ‘Essentials for data support’: http://datasupport.researchdata.nl/en/

URL’s of mentioned webpagesin order of appearance #2

https://youtu.be/s3JldKoA0zw

http://openrefine.org/

http://www.datacarpentry.org/OpenRefine-ecology/01-working-with-openrefine.html

http://www.tue.nl/datacoach

http://www.niso.org/apps/group_public/download.php/15375/PrimerRDM-2015-0727.pdf

http://datasupport.researchdata.nl/en/

A basic course on Research data management, part 4: caring for your data, or making data reusable

Education

Transcript of A basic course on Research data management, part 4: caring for your data, or making data reusable