Responsible conduct of research: Data Management

Post on 13-Apr-2017

140 views 1 download

Transcript of Responsible conduct of research: Data Management

Responsible Conduct of Research: Managing Data

Tobin MagleData Management

Specialist

Nicole KaplanInformation

Manager

Daniel DraperDigital Repositories

Unit Coordinator

Responsible Conduct of Research: The data management firehose!

C. Tobin Magle, PhDPlease ask me for help with data management!

My Background: molecular microbiology

(1) Magle CT et al Infect Immun. 2014 Feb;82(2):618-25. doi: 10.1128/IAI.00444-13. Epub 2013 Nov 25.(2) Sun W, Tanaka TQ, Magle CT, et al.. Sci Rep. 2014 Jan 17;4:3743. doi: 10.1038/srep03743.

Data Workshops

Individual help for ANY data topic

How do I write a DMP?

How do I organize my

data?

How do I clean and format my

data?How do I use R?

How do I get my data ready

to share?

How do I comply with funder mandates?

What DM tools are there for

collaboration?

How do I use R?

Data Management Serviceshttps://lib.colostate.edu/services/data-management

What is data management?

The policies, practices and procedures needed to manage the storage, access and preservation of

data produced from a research project

data management != data sharing

• but the same principles apply to both

Everything* is digital• Data are ephemeral• Big, complex data

• CAN share

*ok not everything, but most things

-> Need new skills

More researchers

https://www.nsf.gov/statistics/2016/nsf16300/digest/nsf16300.pdf

See arXiv:1402.4578 for details

Working Email

Data are extant(If status known)

Status of data (if response)

Response (if email working)

doi:10.1016/j.cub.2013.11.014

We are losing vast amounts of data

00

0

0

0

0

0

0

0

00

0

0

1

1

1

11

1

11

1

1

1

1

1

1

1

0

00

0

0

0

000

000 0

1

1

1 1

10

Who is responsible?

CSU data policy

General points:• The university owns, and is therefore ultimately responsible for

research data

• Researchers are the data managers

• The university promotes openness

http://policylibrary.colostate.edu/policy.aspx?id=737

You’re a Data Manager

http://www.phdcomics.com/comics/archive.php?comicid=382

CSU data policy

Research Data Associated with Theses and Dissertations

To preserve the complete scholarly record of the author, data sets must be incorporated. Therefore, a student depositing their thesis or dissertation is required to make discoverable, accessible and available their associated data sets in accordance with this policy and provisions of the University’s Digital Repository. Access and rights management (embargo period, access limited to specific IP addresses) shall be the same for the associated data sets as it is for the thesis or the dissertation.

http://policylibrary.colostate.edu/policy.aspx?id=737

When should data management happen?

Throughout the whole research cycle

Hypothesis

The research cycle

Hypothesis Experimental design

The research cycle

Hypothesis DataExperimental design

The research cycle

Hypothesis DataExperimental design

Results

The research cycle

Hypothesis DataExperimental design

ResultsArticle

The research cycle

Hypothesis DataExperimental design

ResultsArticle

The research cycle

Hypothesis DataExperimental design

ResultsArticle

Data Management Plans

The research cycle

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Analysis

The research cycle

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

The research cycle

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

The research cycle

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

Reuse

The research cycle

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

Reuse

The research cycle

What is research data?

• “The recorded factual material commonly accepted in the scientific community as necessary to validate research findings”

- White House Office of Management and Budget

• Reality: Applies to any research product

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

Reuse

Working data vs. archived data Working

Archived

What is a data management plan?

A description of how you plan to describe, preserve and share your research data.

Often required by funding agencies

Successful DMPs include

• A data inventory, including type(s) and size

• A strategy for describing the data

• A plan for preserving the data

• A method for access to the data

Always make sure to follow funder requirements

Tool: DMPTool

• Review requirements from different agencies

• https://dmptool.org/guidance

• Create new DMPs based on funding agency templates

• Search public DMPs

Data inventory

• What type of data are you going to collect?• What file type will be produced?• What size will these files be? How many files?

• How will you organize the data?

• What other research outputs will be produced?• Code/Software?• Templates/protocols?

Data inventory

• What type of data are you going to collect?

• What file type will be produced?

• What size will these files be? How many files?

• What other research outputs will be produced?• Code/Software?• Templates/protocols?

miRNA sequences

FASTQ files

1 GB per filex 64 strainsx 3 replicates-------------------~200 GB

R scripts for analysis and visualization

Data use tutorials

Data formats

• Avoid proprietary formats• Know what software can read your data

Proprietary Format Open Format

Excel (.xls, .xlsx) Comma Separated Values (.csv)

Word (.doc, .docx) plain text (.txt)

PowerPoint (.ppt, .pptx) PDF/A (.pdf)

Photoshop (.psd) TIFF (.tif, .tiff)

Quicktime (.mov) MPEG-4 (.mp4)

MPEG 4 Protected audio (.m4p) MP3 (.mp3)

Q’s: Data InventoryWhat kind of data are you going to collect?

What file type will be produced?

What size will these files be? How many files?

What other research outputs will be produced?

Folder systems

• Identify ways to divide your data into categories (Attributes)

• Top level organization is the most important attribute

• Provide documentation

Hierarchical Organizationmy_thesis

chapter1 chapter2 chapter3 chapter4

raw_data

replicate1

replicate2

processed_data code

Processing

cleaning

results

tables

figures

Q’s: Data Organization

• What kinds of files are there? (See data inventory)

• How could you group them?• Project?• Time?• Location?• File type?

• What are the most important attributes?

Tool: Open Science Framework

• Components

• Add-ons

• Contributors

• Wiki

http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014-slides-roadmap-to-the-future-of-share

Organization rules

• Be consistent

• One directory per project

• Separate subdirectories for• Raw data• Processed data• Code (processing and analysis)• Output

• Make raw data read-only

• Make README fileshttp://help.osf.io/m/60347/l/611391-organizing-files

Example: Temperature data

A strategy for describing the data

• Metadata: Relevant information for re-creation and re-use

• Contact info• How data was collected• Details about collection• Date, location of collection• Units

• Can be as simple as a text file

Metadata standards• Dublin Core: http://dublincore.org/documents/dcmi-terms/

• Can be applied to anything

• Many discipline specific metadata standards• EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html• MIAME: http://fged.org/projects/miame/

• Search for other standards: • http://www.dcc.ac.uk/resources/metadata-standards• https://biosharing.org/standards/

Genomics example (NCBI template)

Q’s: Describe your dataWhat do people need to know to reuse your data?

Are there any discipline-specific metadata standards?

What format will you describe your data in (text, XML, tabular)?

What fields will you include (author, date, format, identifier?)

A plan for preserving the data

• Where will it be stored?- Backups

• Necessary metadata and other products

• Who is responsible?

• How long?

Ellin, A. Rutgers Student Offer $1,000 for Data on Stolen Laptop.abcNEWS via Good Morning America. April 26, 2013. http://abcnews.go.com/blogs/business/2013/04/rutgers-student-offers-1000-for-data-on-stolen-laptop/

Backup

Back up recommendations

• Store in geographically distinct locations

• How often?

• Automation: Will you remember to do it manually?

• Security: Are you working with PHI?

Q’s: Preservation planWhat will you store?

Who will be responsible for the data (person or position)?

How long will you store it?

Where will you store it?

How will you back it up?

*Differentiate between working vs. archived

A method to access the data

• Important to funding agencies• Reproduce existing research• Promote further research

• Must be easily available: • No “by request only”• Embargoes are “ok”

• Data security: consider privacy and IP issues before sharing

Data access and sharing best practices

• Non-proprietary formats

• Include metadata

• As open as possible

• Follow CSU research data policy

Trusted Repositories: store and share

• Discipline specific • Search: http://service.re3data.org/browse/by-subject/

• Generic• Figshare - https://figshare.com/• Dryad - http://datadryad.org/

• CSU Digital Repository• http://lib.colostate.edu/digital-collections/ http://

67.media.tumblr.com/6228cbe58a9652f1a85e8ab1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png

Tool: CSU digital repository

• Over 100 Datasets

• Satisfy requirements for manuscripts and grants

• At no cost <1 TB• $150/TB for 5 years• $300/TB for >5 years

Theses and Dissertation Data1. Submit to ProQuest with thesis or dissertation

• Supplemental data file• Only discoverable through thesis or dissertation

2. Submit to CSU Library separately• Requires distinctive descriptive metadata• Linked with thesis or dissertation• Data discoverable globally

Q’s: Access methodsWhere will people be able to access the data?

Does your discipline have a repository?

Are you complying with CSU’s data policy?

How will you format the data for CSU digital repository?

Need help?

• General: library_data@colostate.edu

• Direct: tobin.magle@colostate.edu

• DMPTool: http://dmptool.org/

• Data Management Services website: http://lib.colostate.edu/services/data-management